Sprite 1984

home *** CD-ROM | disk | FTP | other *** search

/ Sprite 1984 - 1993 / Sprite 1984 - 1993.iso / admin / bugs / bugs.archive.1989 < prev next >

Wrap

Text File | 1990-12-11 | 401.6 KB | 12,353 lines

54. Subject: access times from backup not being reset Date: Thu, 16 Feb 89 14:44:35 PST From: Fred Douglis <douglis> I have a script that removes binaries from *.old (/sprite/cmds.*.old, etc.). It checks to see if the access time of a file is very old, because I wouldn't want to remove the old version of an installed command if the command was moved to .old only recently. All the files in *.old, and in my home directory, have been accessed in the past 2-3 days late at night. It looks like tar isn't resetting the access times when it dumps something. Since this is something we talked about, and my impression was that it was implemented, I thought I should mention the problem. 55. Subject: bug: portmap in debugger/nfsmount hung Date: Mon, 20 Feb 89 14:06:23 PST From: Fred Douglis <douglis> I wasn't going through the remote link to nfs properly, and I checked on oregano. portmap was in the debugger. I did a gcore on portmap (output in /tmp/portmap.core) in case anyone wants to look at it. I then found that restarting portmap wasn't sufficient, I had to kill and restart the nfsmount daemon I was interested in. This caused recovery to take place next time I tried accessing /sprite2. However, both before and after the problem occurred, I found that ls was printing "compat: invalid status 0xffffffff", and the only difference is that after I restarted nfsmount, it went on without a hitch even with the error message. 56. Date: Wed, 22 Feb 89 22:40:33 PST From: jhh (John H. Hartman) Subject: 0-length text segments Whenever paprika compiles a file it produces a garbage object that has a 0-length text segment. We were having this problem all afternoon and it suddenly occured to me that maybe one machine was sick. I put it into the debugger in case anyone wants to look at it. "Cat"ing one file into another seems to work, so I can't understand why only compiler output gets trashed. 57. Date: Sat, 25 Feb 89 12:10:34 PST From: mendel@sprite.Berkeley.EDU (Mendel Rosenblum) Subject: murder and thyme dance When I came in this today, murder and thyme appeared to be looping sending RPCs between each other. 58. Date: Mon, 27 Feb 89 08:58:00 PST From: ouster (John Ousterhout) Subject: Bug: mace crash (migration-related?) When I came in this morning and hit the first keystroke, Mace immediately entered the debugger. I got two messages in my syslog window: the first said "Evicting 1 processes", and the second said "Error 1 in SendProcessState" or something like that. Then the machine went into the debugger. 59. Date: Mon, 27 Feb 89 09:24:57 PST From: mendel@sprite.Berkeley.EDU (Mendel Rosenblum) Subject: Bug in brk compatiblity The brk() syscall call emulation doesn't shrink the heap segment when given an address less than the current end of heap. This causes programs that use brk() for allocating and freeing memory to grow without bound. 60. Subject: bug in new gcc? Date: Thu, 02 Mar 89 12:20:27 PST From: Fred Douglis <douglis> Trying to compile loadavg, I now get an error loadavg.c:68: initializer for floating value is not a floating constant This program used to compile just fine, and they sure look like floating constants to me! 61. Date: Sat, 4 Mar 89 16:24:18 PST From: ouster (John Ousterhout) Subject: Gdb bug If a program being debugged by Gdb exits, gdb prints the message "Program exited normally". But if I then quit from gdb, the process is left around in DEBUG state. Shouldn't gdb clean up this loose end? 62. Date: Mon, 6 Mar 89 13:50:30 PST From: ouster (John Ousterhout) Subject: Bug (lpd can't handle printer death) It appears that the lpd system is unable to deal with the death of a printer. I made the mistake of turning off my printer in the middle of a long printout, and when I turned the printer on again there was no way to print anything on it. I tried aborting and restarting the printer with lpc, but even that didn't shake things loose. The only thing I've been able to find that works is to reboot the machine. This seems to be repeatable. Bob, can you take a look? Mace is currently in the hung-printer state, if you have a chance to look at it before I need a printout and reboot. 63. Date: Tue, 7 Mar 89 16:49:48 PST From: jhh (John H. Hartman) Subject: prefix bug The following prefix input will cause a bus error: prefix -x /foo -M /bar 64. Date: Tue, 7 Mar 89 17:54:16 PST From: jhh (John H. Hartman) Subject: another prefix bug If you do something like prefix -x /foo -M /hosts/cayenne/dev/rsd0a you will put the kernel in the debugger. 65. Date: Thu, 9 Mar 89 16:43:52 PST From: mgbaker (Mary Gray Baker) Subject: xbiff problems? I'm just wondering if anyone else is still having problems with xbiff giving them the message "XIO: Unknown error." 66. Date: Mon, 13 Mar 89 09:33:50 PST From: ouster (John Ousterhout) Subject: Bug: mail duplication Has anyone else noticed duplication of mail messages? For example, I got two copies of my last message about large LocalFileIOHandles. I've also noticed this a few times in the recent past, including a message sent to a different distribution list than Sprite (so it can't be just a problem with the sprite distribution list). I'm not sure whether the problem is 100% reproducible. 67. Subject: bug: /initsprite is not machine-independent Date: Wed, 15 Mar 89 10:35:16 PST From: Fred Douglis <douglis> That is to say, when someone installed a new initsprite on March 8, sun2's stopped booting because initsprite is a sun-3 binary. 68. Date: Wed, 15 Mar 89 11:17:20 PST From: gibson (Garth Gibson) Subject: ls convention irregularity If a file in the local directory is a symbolic link to another directory, then ls -sF lists it as a directory (sufix is /) (ls -l shows it as a link). This differs from both vax and sun unix (which use the suffix @ for a symbolic link) If the local symbolic link points to a file then sprite conforms with unix (its suffix is @). 69. Subject: bug: tty should be like unix tty Date: Wed, 15 Mar 89 16:12:18 PST From: Fred Douglis <douglis> In BSD unix, one can say something like "rcp foo:bar `tty`" to copy to the terminal invoking the command. /dev/tty may be used similarly. In sprite, tty is a program to create a terminal driver with a pseudo-device, and /dev/tty doesn't exist. (Before anyone does anything about renaming tty, beware that some scripts may invoke tty. For example, /hosts/pride/bootcmds runs tty on /dev/console to make login use a terminal that understands control characters.) 70. Date: Thu, 16 Mar 89 11:31:57 PST From: ouster (John Ousterhout) Subject: Printing software broken? I'm no longer able to print on Mace's printer. Lpq prints this: Ready and printing. Rank Owner Job Files Total Size active ouster 23 (standard input) 14655 bytes but nothing happens. Can you take a look? I tried rebooting, thinking the kernel might be wedged, but that didn't solve the problem. I also tried power-cycling the printer; this also didn't help. Then I noticed that psdit seems to be looping infinitely. I tried a few test cases, including files that I KNOW printed a few days ago, but psdit always seems to get into an infinite loop. Printing still works OK for files that aren't coming from ditroff. 71. Date: Thu, 16 Mar 89 17:48:56 PST From: mendel@sprite.Berkeley.EDU (Mendel Rosenblum) Subject: /sprite/spool/mail/mendel corrupted. My incomming mail file (/sprite/spool/mail/mendel) appears to have been corrupted. I got a message from Susan Eggers that was inserted in the middle of the last message rather than appeaded to the file. 72. From: rab (Robert A. Bruce) Subject: make install bug Date: Thu, 16 Mar 89 18:28:33 PST When I run make install in either /a/adobecmds/* or /sprite/src/admin/* make tries to copy the previously installed executable to */sun3.md.old, but can't do it because the sun3.md.old directories don't exist. So I have to remove the currently installed program before it will install the new one. 73. Subject: bug: migrating X application hits negative refcount Date: Fri, 17 Mar 89 13:03:47 PST From: Fred Douglis <douglis> for example: % xman& % sleep a while % mig -p <xman_pid> your host goes into the debugger with a negative write count on the pdev stream. If continued, it will continue to enter the debugger with a complaint about unknown lclpdev. If the process is kill -KILLed on the other host, the home node may be continued without a problem. 74. Subject: bug: vm pagein/pageout errors and signals == deadlock Date: Tue, 21 Mar 89 12:32:51 PST From: Fred Douglis <douglis> Paprika hit a monitor deadlock when oregano crashed and rebooted. JHH and I chained through the processes and found that the following sequence of events took place: ... 75. Subject: update -l change Date: Wed, 22 Mar 89 16:51:59 PST From: Fred Douglis <douglis> I tried to install vm but hit a complaint from update about symbolic links. Did someone change kernel.mk recently to make it copy the files referenced to by symbolic links? Anyway, I had to remove the symbolic links before updating the files that had been changed, so it would install new files rather than complaining about the mismatch. (I'm sending mail so no one else wastes time tracking down the same problem.) furthermore, the complaint by update is misleading: it says that the source file is a real file when the target is a symbolic link, whereas in fact the source file is a symbolic link but it appears as a regular file to update because of the "-l" option. 76. Date: Wed, 22 Mar 89 21:11:13 PST From: mendel@sprite.Berkeley.EDU (Mendel Rosenblum) Subject: bug in reset command When I type reset from a vt100 terminal I get the message Cannot open /usr/lib/tabset/vt100 77. Date: Thu, 23 Mar 89 10:54:12 PST From: gibson (Garth Gibson) Subject: tx I just used the "~h" command in Mail in a tx shell that was "rsh"'d to pepper. The To: line contained alot of names. I heard a set of beeps and then the window became usless. I killed it and started a new one. garth 78. Date: Thu, 23 Mar 89 11:13:58 PST From: gibson@pepper.berkeley.edu (Garth Gibson) Subject: basil lockup A few minutes ago basil locked up. It had been running 6 days (that is all I remember about the kernel that was running). I had just completed a mail message in a tx window, rsh'd to pepper when the mouse froze. The spritemon continued, but L1-v etc did not generate output. Brent rsh'd in and found nothing interesting (Xsprite was OK). Finally I did L1-k which did get me control. Then C-c. I suppose I could have started a new X at this time, but instead I rebooted (22 Mar 89 18:19:35) kernel. garth 79. Date: Sun, 26 Mar 89 12:55:06 PST From: brent (Brent Welch) Subject: mace crash Mace died inside Mach_MonPutChar when printing a message about "[1] + Segmentation violation Xsprite\n". The error was inside the prom, I think, and was probably an address error of some sort. It ended up panicing three times on the way into the debugger, first from Mach_MonPutChar, then from IdleLoop() because I'll bet that interrupts were off, and then again inside Mach_MonPutChar as it tried again to print an error message. 80. Date: Tue, 28 Mar 89 09:17:07 PST From: mendel@sprite.Berkeley.EDU (Mendel Rosenblum) Subject: oregano and thyme in debugger When I came in this morning oregano was in the debugger with the message: Fatal Error: Fs_RpcStartMigration, unknown lclPdev handle <..>. and thyme was in the debugger with a bus error. 81. Date: Tue, 28 Mar 89 17:29:55 PST From: brent (Brent Welch) Subject: mint's ipServer died I think mint's ipServer died today when /sprite was filling. Mint swaps to /sprite and I'll bet the ipServer got a swap error. I've restarted the ipServer and currently there is plenty of disk space. 82. Date: Wed, 29 Mar 89 09:54:54 PST From: jhh (John H. Hartman) X-Mailer: Mail User's Shell (6.4 2/14/89) Subject: sage and mint dancing When I came in this morning (somehow I managed to be the first one here), sage and mint were in a recovery dance. Sage would complain about a stale file handle for /sprite/admin/migInfo, they would recover, and then sage would complain again. This went on every few seconds all night from the look of the pile of paper behind mint's console. Mint was complaining that there was no stream associated with the file. Has the recovery code been modified recently? 83. Subject: Re: mkmf bug (and more file rot) Date: Wed, 29 Mar 89 13:18:44 PST From: Fred Douglis <douglis> yes, that's exactly it. I was going to change it but couldn't check out mkmf.map because its RCS file is garbage. Looks like a line from the migInfo file at the start of the RCS file!! This either means recovery screwed up and let the wrong file get written, or the disk got trashed at some point. 84. Date: Wed, 29 Mar 89 17:42:26 PST From: ouster (John Ousterhout) Message-Id: <8903300142.AA334635@sprite.Berkeley.EDU> To: sprite Subject: Bogus messages Why do I keep getting syslog messages like these? Fs_NotifyWriter, bad handle Fs_NotifyWriter, bad handle Fs_NotifyWriter, bad handle Fs_NotifyWriter, bad handle Is this an over-conservative check that should simply be eliminated? 85. Subject: bug: oregano fs deadlock Date: Fri, 31 Mar 89 15:55:01 PST From: Fred Douglis <douglis> Oregano hung an rpc for me earlier today, then started wedging things left and right. I was able to debug it for a while before kgdb core dumped on me, then I gave up. The backtrace is in /tmp/oregano.where in case Brent wants to look at it -- it showed at least a couple of processes hung in Pfs stuff. 86. Date: Fri, 31 Mar 89 16:58:44 PST From: ouster (John Ousterhout) Subject: Pseudo-device buffering problem? Even with the new version of the tty driver, it appears to me that too much buffering is going in in the pdev implementation. For example, if I rlogin to Sprite using the new rlogind, cat a long file, and then type ^C, an awful lot more characters come out before the ^C takes effect. I tried reducing the size of the pdev buffer and the tty buffer, but this had no noticeable effect on the # of characters that come out before signals take effect. 87. Date: Sun, 2 Apr 89 17:35:46 PDT From: jhh (John H. Hartman) Subject: rlogin problem If I try to rlogin from unix and I decide not to login I can't kill the login prompt. '^D' doesn't seem to work. 88. Date: Mon, 3 Apr 89 16:39:09 PDT From: douglis (Fred Douglis) Subject: X (tx) bug: window on rebooted host hangs system I made the mistake of running tx on mint with the display on paprika, then trying to click in the tx window after mint had been rebooted. >From that point on, I couldn't get the input focus or do anything else; even xkill said it couldn't grab the mouse, so I couldn't kill the tx window. I finally had to restart X. Seems like we need a way for connections to rebooted hosts to be forcibly destroyed, and for them to time out when appropriate as well. 89. Subject: bug: "rsh host cmd" hits bus error Date: Mon, 03 Apr 89 17:52:03 PDT From: Fred Douglis <douglis> I can do "rsh xxx" but not "rsh xxx cmd" -- it hits a bus error. Seems the installed rsh is dated november, and there's an uninstalled one dated Mar 24. Can the uninstalled one be installed, so we can debug this problem if it persists? rsh with a command argument worked not too long ago. 90. Subject: bug: repeating device write Date: Tue, 04 Apr 89 02:35:59 PDT From: Fred Douglis <douglis> Several times today, a host has gottten into a funny situation in which it repeatedly wrote the same line someplace as the result of a single write operation. The first time, paprika's syslog printed the same SU message repeatedly, and Mendel and I looked at it but couldn't track down the problem, and it cleared itself up after we resumed. The second time, I believe it was oregano with the problem (also syslog), and the third time it was an rlogin from thyme to murder where the same line from a process running on murder kept getting written over and over. I threw thyme into the debugger on general principles, but I'm leaving now, so I don't know if this can be looked into. I'm reporting the bug so people know to be on the lookout, and maybe we can debug it sometime under more reasonable circumstances. 91. Date: Wed, 5 Apr 89 12:13:28 PDT From: douglis (Fred Douglis) Subject: bug: device recovery I had been catting /hosts/nutmeg/dev/syslog earlier, then after a reboot I got Recovery failed <1> (as usual) but this time hit subsequent errors: [thyme]/sprite/users/douglis (5)% cat /hosts/nutmeg/dev/syslog cat: read error: stale remote file handle [thyme]/sprite/users/douglis (6)% !! cat /hosts/nutmeg/dev/syslog /hosts/nutmeg/dev/syslog: invalid argument [thyme]/sprite/users/douglis (7)% !! cat /hosts/nutmeg/dev/syslog /hosts/nutmeg/dev/syslog: invalid argument 92. Date: Fri, 7 Apr 89 14:41:12 PDT From: douglis (Fred Douglis) Subject: problem with kmsg? %kmsg -v basil RecvReply: Error reading socket. Debug any idea what's up? I saw this yesterday too. 93. From: tve@ernie.Berkeley.EDU (Thorsten Von Eicken) Date: Sat Apr 8 00:27:09 PDT 1989 Subject: sprite dies in Pdev I use the Pdev library. I can open the server side of a pdev, but as soon as I receive a client's open request, the server dies and takes the machine with it. I ran my program in the debugger. I get to PdevServiceRequest which calls my open service routine. The flags passed to the serive routine look very suspicious: (gdb) step ServOpen (cd=(ClientData) 0x0, f=(struct Pdev_Stream *) 0x26408, buff=(caddr_t) 0x25078 "\377\377\377\377", flags=4231170, proc=724277, host=13, user=2984, sel= (ClientData) 0xdfdfce8) (comm.c line 247) in my service routine, I determine I dislike the flags and return with EACCES. I get back into PdevServiceRequest (without changing the selectBits) which then calls ReplyNoData. The thing then dies in that function (I haven't traced more). 94. Subject: bug: rlogind infinite loop when userLog locked Date: Sat, 08 Apr 89 14:46:43 PDT From: Fred Douglis <douglis> Symptoms: user rlogins to sprite and exits; never returns to remote host. On sprite, rlogind is in the READY state much of the time. A backtrace showed rlogind in flock. Before calling flock, it sets up an interval timer to send SIGALRM in 10 seconds. gdb claims that the signal handler for SIGALRM is never called. I wound up just copying the userLog to another file and overwriting the original, to break the lock that was causing the problem. At least rlogind will work in the meantime. I'll continue to try to look into the problem. If anyone knows of any recent changes to signals, interval timers, or anything else that might account for this change in behavior, please let me know. (Recent == past few months.) 95. Subject: bug: rlogin ~^Z incompatible Date: Sun, 09 Apr 89 17:56:21 PDT From: Fred Douglis <douglis> Under unix, my understanding is that ~^Z stops the rlogin without output continuing from it, while ~^Y stops it but lets output continue. Under sprite, ~^Z causes output to continue, which can be pretty annoying.... 96. Date: Tue, 11 Apr 89 20:15:24 PDT From: mendel@sprite.Berkeley.EDU (Mendel Rosenblum) Subject: tx bug Tx jumps into the debugger if you type the following command followed by a carriage return: ~brent/bin/read -help 97. Subject: copy-on-write crashes Date: Wed, 12 Apr 89 15:22:16 PDT From: Fred Douglis <douglis> Paprika has crashed twice in the past two days with the message: "COW: numCORPages < 0" This seems to happen when I fork children from emacs and then the parent emacs process exits. They all share a large address space, which is mostly untouched by the children (they're sitting around doing Fs_Dispatches). The children are exiting at just about the same time. It's not repeatable, or at least I don't know yet what might make it repeatable. Sorry. If anyone has any interest in pursuing the problem, or has any insight into what could cause it, please let me know. There's a kernel core dump in mendel's tmp directory on /b. 98. Date: Wed, 12 Apr 89 22:20:38 PDT From: gibson (Garth Gibson) Subject: vi problem I am logged in from home, editing a file on a sprite disk using vi. I wanted to do many instances of a simple change - search for last pattern, repeat last change. Of course, the screen redraw fell way behind. Then everything just stopped. Control C had no affect, neither did ESC or control L. I could ~~^Z back to unix and re-login to basil. Ps said the vi was in RWAIT. I looked at the file and it appeared quite old (I do periodic :w in vi out of paranoid habit). I will blow the process away and redo what I lost. 99. Date: Wed, 12 Apr 89 22:23:47 PDT From: gibson (Garth Gibson) Subject: vi problem revisited This may be an rlogin problem (overflow input buffer?). When I killed the process, "Killed" wa displayed and I got a new prompt, but all keystrokes were still having no effect. I'll kill the login. garth 100. Subject: bug: signal deadlock Date: Thu, 13 Apr 89 11:01:37 PDT From: Fred Douglis <douglis> I was running gdb under tx when I decided to restart the debuggee. The tx window went dead (no input, no menu highlighting, whatever). When I tried running programs from other windows, one by one they completed but didn't return to the shell. An l1-p showed they were in the exiting state, and it showed that there was a Proc_ServerProc and a csh waiting on the sig monitor lock. I couldn't find any other processes waiting on static locks (things I could find in an nm listing). 101. subject: interval timer bug (rlogind) Date: Fri, 14 Apr 89 00:48:20 PDT From: Fred Douglis <douglis> I noticed that the rlogind hanging bug had returned. I poked around in the kernel and discovered that the reason rlogind was ready so often, rather than waiting forever, was that it was getting signalled every 20 microseconds. This was due to a bug in procTimer.c that set an interval of <0,0> to <0,20> -- it would be correct to set 1 microsecond to 20 (the minimum timer resolution), but not 0, which indicates the timer should only be hit once. 102. Date: Sat, 15 Apr 89 17:40:43 PDT From: mendel@sprite.Berkeley.EDU (Mendel Rosenblum) Subject: file system deadlock bug Sprite deadlocks when you try and umount a disk with the prefix command: prefix -U /local The deadlock is as follows: Fs_Command calls Fs_PrefixClear which graps the prefixLock monitor lock. Fs_PrefixClear calls FsPrefixHandleClose which also graps the prefixLock monitor lock. 103. Subject: bug: recovery affects pdev access times Date: Tue, 18 Apr 89 15:18:29 PDT From: Fred Douglis <douglis> When oregano rebooted a few minutes ago, apparently every active rlogin pseudo-device got reset. Therefore, a finger on sprite lists 5 rlogin connections as having identical idle times (40 minutes or so, which is when oregano rebooted) and the only rlogins with different idle times are those that have been active in the past 40 minutes. 104. Subject: recovery bug Date: Mon, 24 Apr 89 12:56:43 PDT From: Fred Douglis <douglis> Paprika has been going through the following recovery loop for a while: it finds out mace is up, it finds some locked handles and prints GetNextHandle skipping this that and the other thing, it tries to recover something with mace and gets a timeout, and decides mint is dead: 105. Subject: bug: null object file Date: Mon, 24 Apr 89 15:57:24 PDT From: Fred Douglis <douglis> I just did a compilation and wound up with a .o file full of nulls. No idea whether it was done locally or via migration, or what might have caused this bizarre behavior. I compiled everything in a directory and the others are apparently okay (at least ld complained only about the next-to-last one it looked at). I'd be interested in hearing if anyone else notices this sort of behavior. Also, I looked very briefly in the sprite log to see if this had been reported before -- it seems slightly familar -- but I couldn't find anything under some obvious keywords. 106. Date: Mon, 24 Apr 89 17:47:34 PDT From: jhh (John H. Hartman) Subject: mx bug I typed "ESC F" (goto search string and delete what's there) and the entire mx window died with the following error: thyme<jhh 333> X Error: parameter mismatch Request Major code 42 Request Minor code ResourceID 0xb00079 Error Serial #905 Current Serial #905 107. Subject: sendmail bug: mail stuck in queue Date: Fri, 28 Apr 89 16:14:07 PDT From: Fred Douglis <douglis> Mail to *.dec.com is apparently getting stuck in the mail queue. I confirmed with Mike that mail to mnelson%decwrl.dec.com@ginger got through, though mail from sprite is not. No reason why so far -- I haven't debugged sendmail -- but you might want to redirect your mail via a unix machine for the time being. 108. Subject: mint's ipserver died / disk full msgs Date: Sat, 29 Apr 89 11:18:56 PDT From: Fred Douglis <douglis> at 2:40am murder rebooted and mint printed out many messages about domain alloc failed. at the end, the printer wasn't keeping up, so messages were lost, possibly saying something about the ipserver, so I couldn't find out why the ipserver disappeared. The half-hourly message was printed at 3am, and immediately after that inetd complained about select errors and exited. I couldn't check ip.out because somewhere along the line "/hosts/mint/restartservers" got changed to overwrite ip.out rather than append to it, and the old version was lost before I got back downstairs to look at it. I don't know what to do about the ipserver's random skittishness, but I do have a suggestion about the console message problem: can the "Domain Alloc Failed" message be counted (and have a message about which domain it's talking about), so if the same message comes up many times, it only gets printed once before the domain empties again? 109. Date: Sat, 29 Apr 89 12:11:37 PDT From: mendel (Mendel Rosenblum) Subject: bug in fsmake The file system assumes that the disk label is copied to the first block of each partition. Fsmake doesn't do this. 110. Subject: fscheck causing extraneous reboots? Date: Sat, 29 Apr 89 14:22:01 PDT From: Fred Douglis <douglis> Is fscheck causing mint to reboot unnecessarily? I went to see why mint was taking so long to reboot (its RPC system wedged after some recovery error mendel had rebooting murder; debugging caused a watchdog reset before anything could be determined). It had rebooted after checking the root even though nothing was printed out about problems with the root. If the data block bitmap being different on disk is the only thing, is it necessary to shut down and reboot? (It didn't even complain about that, but it seemed like the likeliest problem.) 111. Date: Sat, 29 Apr 89 14:58:22 PDT From: mendel (Mendel Rosenblum) Message-Id: <8904292158.AA397601@sprite.Berkeley.EDU> To: sprite Subject: bug2 is fscheck fscheck writes the disk when it finds duplicate blocks even if the -write flag is not specified. 112. Date: Mon, 1 May 89 12:08:15 PDT From: jhh (John H. Hartman) Subject: thyme's ipserver died My ipserver died with a bus error in malloc(). It looks like it was trying to do a large allocation and the current memory pointer was bad. I don't really know because it wasn't linked with the debugging version of libc. I had a problem with the ipserver dieing because its timer callback queue was messed up. My guess is there is a wild pointer somewhere. 113. Subject: bug: sendmail zeroed memory Date: Tue, 02 May 89 10:42:35 PDT From: Fred Douglis <douglis> Sendmail occasionally goes into the debugger with a bus error trying to dereference a null pointer when rewriting addresses. Turns out some data structures that are normally initialized from the .cf file are all zeroed out. Unfortunately, I still don't have a recreatable test case, but I do know that the bug only seems to appear when sending mail to internet hosts that are probably not in the host table (i.e., a lengthy name server lookup may be required). Also, the sendmail process that hits a bus error is actually the child of a process that initialized the data structures, so it's conceivable (but unlikely) that the bug is in VM rather than in sendmail itself. 114. Date: Tue, 2 May 89 16:48:10 PDT From: douglis (Fred Douglis) Subject: bug: ipServer memory leak? For the past couple of days, just about any time I've used the internet from paprika (sending mail, printing files, etc) my system would hang up. I checked the ipServer and it had a resident set of almost 2 megs with a total memory image of 5 megs. paprika had been up since sometime over the weekend, I think. other hosts don't show enormous ipServers, but perhaps this is because I use unix X applications talking over TCP to my host, and because I've been printing things on paprika from Unix. 115. Subject: bug: locked sendmail files Date: Wed, 03 May 89 11:58:25 PDT From: Fred Douglis <douglis> I did a mailq and found a lot of locked files in the queue, dating back to this morning before mint rebooted. Anyone know anything about this? 116. Subject: bug: "swap down" error Date: Fri, 05 May 89 09:59:41 PDT From: Fred Douglis <douglis> I found that processes migrating to basil were getting stuck -- not running, not killable, nuttin'. I saw Garth wasn't around, so I threw basil into the debugger. (Sorry, Garth -- when I continued basil, it panicked with a complaint that "current process is nil" -- maybe kgdb didn't continue it properly after I changed processes?) The migrated process was stuck in an unkillable state because "swapDown" was set and it was waiting for someone to notify it that the swap area isn't down. Of course, we all know /c is just fine right now, so basil somehow got fairly confused. 117. Date: Fri, 5 May 89 23:27:25 PDT From: jhh (John H. Hartman) Subject: bugs in malloc() I ran a user level program that tries to malloc a giant piece of memory. Two problems occurred: 1) The call in MemChunkAlloc to sbrk failed (correctly) but MemChunkAlloc called panic. Shouldn't malloc return 0 rather than terminate the process? 2) Panic calls fprintf, which eventually calls StdioFileWriteProc. Since nothing has been written to stderr yet, StdioFileWriteProc calls (you guessed it) malloc to allocate a buffer. This is very bad. Stderr should not be buffered. If, however, we get rid of the call to panic both of these problems go away. Any comments? 118. Subject: bug: fs consistency hanging Date: Mon, 08 May 89 12:25:30 PDT From: Fred Douglis <douglis> Andreas reported that he couldn't get a login working, and it turned out that opens and stats of "~stolcke/.cshrc" were hanging. I debugged mint and found that everyone was waiting on a CONSIST_IN_PROGRESS that didn't seem to exist (I didn't find anyone actually in the middle of consistency). When I went to reboot mint, I saw something in its syslog about consistency with fenugreek for this file timing out, so it looks like somehow the flag didn't get reset properly. Furthermore, fenugreek was getting lots of timeouts followed by "fenugreek is up", which implies that maybe fenugreek's channels for communication with mint were hung up, perhaps by all the pdev-related operations Andreas was doing. (Mary is seeing many pdev-related syslog messages during recovery). 119. Date: Sun, 7 May 89 11:58:18 PDT From: douglis@rosemary.Berkeley.EDU (Fred Douglis) Subject: bug: pfs callback deadlocked oregano Oregano locked up sometime late yesterday or early today, with just about every process blocking on the prefixLock. Thanks to John's lock information, I was able to figure out which process was actually holding the prefix lock (a good case for leaving this information available *all the time*). Someone called Fs_PrefixDump, which locked the prefix monitor and called FsDomainInfo. This called FsPfsGetAttrPath, which called FsPseudoStreamLookup, which called RequestResponse. In the meantime, there were various other things, like reopens, going on. I couldn't figure out how to save my gdb window once it was going, so I can't provide a full backtrace, but the gist was that someone was trying to reopen a file and was blocked because the handle was locked; someone else was trying to delete something and was blocked on the handle, etc. I didn't pay much attention to this once I found the prefix table lock held down during the callback. One more bug, while I'm at it: saying "boot" without any arguments just hangs on oregano, and booting from ginger results in it shutting down and rebooting unsuccessfully from its local disk if there's an error on the root. 120. Subject: bug: missed notification on packet output? Date: Tue, 09 May 89 10:07:22 PDT From: Fred Douglis <douglis> Wei reported that a migrated process got wedged, and I found that it was stuck doing a remote write to its home machine -- the thing is, it was stuck in the low-level network code, waiting to be told a packet had been output, rather than in the RPC code as I had expected. I called Brent and he's looking at it, but I figured I'd record the bug to make sure it's on the bug list. 121. Subject: bug: lprm doesn't stop job in process Date: Tue, 09 May 89 11:16:39 PDT From: Fred Douglis <douglis> I accidentally sent a 35-page job to the printer when I meant to select only a page from it. When I did an lprm a moment later, it claimed to remove the job, but it came out nevertheless. I believe that in Unix I have been able to stop jobs even after they have started printing, and certainly before they start printing. 122. Date: Wed, 10 May 89 23:21:37 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: bug in DevNet_FsOpen Perhaps Mendel's new dev implementation fixes this, but I thought I'd better report it anyway. DevNet_FsOpen calls malloc with a semaphore held (protoMutex). If vmMonitorLock is held you go into the idle loop with the interrupts off. I guess this doesn't usually happen on a sun, but it just did on the spur. 123 Date: Fri, 12 May 89 08:09:39 PDT From: douglis (Fred Douglis) Subject: migInfo file locked again (bug) something must have hung, been suspended, or been thrown into the debugger with the lock to the migInfo file held, because Wei sent me mail last night commenting that after relinking with the fixed node selection code, the time to select an idle host went to 10 seconds! I looked around for obvious candidates, didn't find any, and instead copied the file back to itself and restarted as many loadavg daemons as I could. Another case for using a server-based model instead of a single shared file, as far as I'm concerned. 124 Date: Fri, 12 May 89 08:12:30 PDT From: douglis (Fred Douglis) Subject: bug: can't rlogin to mustard When restarting all the daemons, I found I couldn't rlogin to mustard. migrating to it works fine and lets me list the running processes, which include ipServer and inetd. Any ideas? It will be listed as "down" until someone kills the old loadavg and starts a new "loadavg -dv" process. 125 Subject: bug: murder power-on-reset Date: Fri, 12 May 89 16:59:38 PDT From: Fred Douglis <douglis> Murder bit the big one earlier today when its ethernet cable popped out and then was reconnected. Is this a software fault or a problem with the hardware?? 126 Subject: bug: repeated obituaries Date: Mon, 15 May 89 21:26:49 PDT From: Fred Douglis <douglis> It's a little distracting to see "mace considered dead" once every minute or two. I can't imagine that the system thinks mace has gone from being alive to being dead, so there must be a bug that's causing it to say mace is considered dead when it's already dead. This may be tied to the fact that someone is probably trying to write over a pseudo-device to mace with some probability, once per minute. 127. Date: Mon, 15 May 89 21:52:01 PDT From: pmchen (Peter M. Chen) Subject: new user report Bugs: Before I got X running, I was using the console window: 1) more doesn't work TIOCLGET: invalid argument 2) vi doesn't work I vi'ed a file, then edited, then ctrl. Z, then foregrounded (%) When I foregrounded, the most recent change was gone. Also when I foregrounded, the screen paused until I hit a key. 3) set filec doesn't work (it does under tx and X) Once I got X running, life was much better. I still had some problems, though: 1) mouse movement is skewed (when I move the mouse vertically up, it goes at about a 10 degree angle to the right. 2) caps lock doesn't work (nor F1) 3) df prints out wrong information for nfs mounted file systems 4) "ls -F" lists symbolic links to directories as directories instead of symbolic links. E.g. ls /spur2/pmchen lists 262@ from unix but 262/ from sprite. This isn't necessarily a problem, but it is different from unix. Good things about sprite and tx: 1) tx looks nicer, and the fonts can be smaller with seemingly better resolution 2) vi printing response time seems faster under tx than xterm 3) my machine beeps (ctrl. G), which it never was able to do before (even under raw console) 4) tx is better at cutting and pasting than xterm 5) once you get X running, most things seem to work right away tx and uwm wish list: 1) xterm lights up the window that you're working in (in the title bar section. Can tx? 2) I'd like to save screen space and get rid of the command window and the "Control Search Selection" window. Why not use (as xterm) ctrl. mouse to get the Control, Search, and Selection? 3) xterm has a menu item to reset the terminal, which tx doesn't. It comes in real handy sometimes. 4) I'd like to be able to dynamically change the title of a tx window 5) I'd like to have deiconify warp Questions: 1) can I get named pipes (like the unix command mknod)? 2) how do you use xbiff? I see it in ~douglis/cmds.sun3, but I can't make it work 3) is there a proofer (such as xproof)? 4) is there an easy way to exit out of X? 5) are there common places to look for utilities and help without bugging you guys? 128. Date: Tue, 16 May 89 14:09:30 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: prefix bug This may be a feature, but I consider it a bug. Last night oregano would boot as root and export /a, /b, /c. In oregano's /local (which was serving as / ) there weren't any remote links for /a, /b, and /c. Oregano did not complain about this and on oregano I was able to cd to these directories. Other machines couldn't find them and would not boot. It took me a while to figure this one out. I don't think prefix should allow you to export a prefix that doesn't have a remote link. If you just want to change the name of a prefix on a particular machine you can do the same thing with a symbolic link, rather than the prefix command. 129. Subject: fs bug: bogus type Date: Wed, 17 May 89 14:42:04 PDT From: Fred Douglis <douglis> paprika crashed a short time ago with an address error resulting from Fs_GetAttributes calling a routine based on an invalid type (32). The core file is in /c/tmp/mendel/vmcore if that would be of use to Brent (please delete if not). Sounds like some checks for bogus types would be useful. 130. Date: Sun, 21 May 89 22:01:40 PDT From: douglis (Fred Douglis) Subject: bug: nawk & gawk incompatible gawk was installed, and nawk removed, but a script that works with nawk doesn't work with gawk. I believe it's because nawk allows variables to be defined on the command line. Check out ~douglis/bin/KernelVersions for an example of a command that produces no output using gawk. 131. Subject: bug (sort of): gcc & float Date: Mon, 22 May 89 00:06:19 PDT From: Fred Douglis <douglis> it seems that a number of programs that compile just fine under sunos using the std. cc produce incorrect code under gcc, due to the use of "float" v. "double". does anyone know whether other versions of the C library (pre-ANSI) use floats instead of doubles, or something? Andreas reported that "pic" produced bad code because of this, and now I found that ggraph produced a garbage graph under sprite, and has lots of use of floats. I also am starting to think my trouble with TeX is due to gcc v. whatever everyone else uses. 132. Subject: bug: non-ready process Date: Wed, 24 May 89 11:02:14 PDT From: Fred Douglis <douglis> paprika just crashed with a "non-ready process in ready queue", followed by a deadlock syncing the disks, followed by a deadlock on sched_Mutex, followed by aborting and requiring a watchdog reset to stop being comatose. 133. Subject: bug: tftpboot borken again Date: Thu, 25 May 89 11:27:16 PDT From: Fred Douglis <douglis> a kernel that runs fine from unix gets "exception 10" immediately after booting from mint. 134. Date: Sat, 3 Jun 89 16:09:14 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: bug in man macros The .VS macro starts sidebars, but the .VE doesn't seem to stop them. They continue to the end of the document. 135. Date: Mon, 5 Jun 89 17:48:11 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: migration bug Just for fun I decided to migrate a load of the spur kernel away from my host while it was running. I typed "mig -p <processid>" and it was migrated to mustard. When the load completed thyme thought the size of the resulting kernel was about 1 Mb, while the rest of the system including the fileserver thought it was its normal 4 Mb. I think the size on thyme is the size of the file at the time the migration occurred. 136. Date: Fri, 9 Jun 89 15:01:37 PDT From: douglis (Fred Douglis) Subject: bug: exclusive access to console when using debugger I was catting /hosts/sloth/dev/syslog when I tried to attach to sloth using kgdb. I had to interrupt the cat process and reattach in order to get through the attachment procedure. Before that, it just hung indefinitely. 137. Date: Fri, 9 Jun 89 15:12:51 PDT From: brent (Brent Welch) Subject: bug main_ variables We should clean up how the various main_ variables are declared and set. Now that we have Main_InitVars there is no reason to have: char *main_HomeDir = "/"; /* * Flags to modify main's behavior. Can be changed without recompiling * by using adb to modify the binary. */ Boolean main_Debug = FALSE; /* If TRUE then enter the debugger */ Boolean main_DoProf = FALSE; /* If TRUE then start profiling */ Boolean main_DoDumpInit = TRUE; /* If TRUE then initialize dump routines */ int main_NumRpcServers = 2; /* # of rpc servers to create */ char *main_AltInit = NULL; /* If non-null then contains name of * alternate init program to use. */ Boolean main_AllowNMI = FALSE; /* If TRUE, allow non-maskable interrupts.*/ like I do in my mainHook.c file 138. Subject: bug: when the disk fills ... Date: Tue, 13 Jun 89 11:44:11 PDT From: Fred Douglis <douglis> I know this has been brought up in the past, and I thought measures had been taken. If so, they weren't sufficient: when I filled up /a, my host became entirely unusable because it was printing "domain full" messages as quickly as it could (on the display because the syslog window couldn't keep up), and I couldn't get in to remove anything. How about associating a bit with each file that says whether it has been unsuccessfully flushed to disk? Each file could be printed out only once that way. The other thing is, when the disk fills up, the client could try waiting a while before flushing again. If the client can't do anything else in the meantime because its cache is full of dirty data, then it could wait rather than beating on the server while someone on another host tries deleting something. What ever happened to the idea of checking the available space before filling up the cache? Seems like there must be a better way to handle this, and we should deal with this before we put more people on the system. 139. Date: Fri, 16 Jun 89 13:19:22 PDT From: mendel (Mendel Rosenblum) Subject: bugs in fscheck and boot sequence During the boot sequence, if the file .fscheck.out does not exists fscheck appears to write its output to root directory of the file system being checked. The only recover from this is remaking the file system. Fscheck doesn't appear to be able to fix a disk whose root directory was trashed. Also, the mkdir program should probably be added to /boot/cmds. 140. Date: Fri, 16 Jun 89 18:18:23 PDT From: douglis (Fred Douglis) Subject: bug oregano fscheck loop yet again, oregano would not reboot. apparently someone started it rebooting around 5:15 this afternoon without notifying anyone else and without sticking around to look at it; John O. and I wandered up there and saw it was rebooting, and left it alone untiL I decided it wasn' getting anywhere. When I rebooted single-user, there were a few problems (like the $path wasn't set up to execute anything!) but i was able to attach /c and see that lost+found was full again. I tried creating and deleting lots of files, getting the size of the directory up o 16K and that still wasn't enough. i finally gave up and rebooted with a fastboot, so * /c still has not been checked *. this was after 3 or 4 attempts to get fscheck to complete without filling up lost+found. 141. Date: Sat, 17 Jun 89 13:27:08 PDT From: mendel (Mendel Rosenblum) Subject: bug is fscheck The -hostID option of fscheck should allow the user to specify the hostID to set in the disk header. I made tonkawa's disk on murder so the hostID was set to 17. When I booted tonkawa it would initialized its hostID from the disk so I couldn't change it. I had to L1-a tonkawa during the boot, set rpc_SpriteID to 15 from the PROM, continue the boot, and run fscheck. 142. Date: Mon, 19 Jun 89 21:39:06 PDT From: brent (Brent Welch) Subject: SendTimerSigFunc bug? Mendel had complained that the timer queue was filling up in the new kernels. I did some debugging and noticed many entries due to SendTimerSigFunc, which is used for process interval timers. There is a level of indirection that must be followed to see this. SendTimerSigFunc is called from CallFuncFromTimer, which is the function in the timer queue. Anyway, it looks like some process is either way overusing the interval timer stuff, or some recent change has broken it and the timer reschedules itself incorrectly. 143. Date: Tue, 20 Jun 89 22:49:11 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: file permissions bug If I try to chmod /sprite/src/kernel I get : thyme-3# chmod 775 /sprite/src/kernel chmod: /sprite/src/kernel: too many levels of symbolic links Also, the file LOCK.make existed in /sprite/src/kernel and was owned by me : -rw-rw-r-- 1 jhh 0 Jun 2 14:28 LOCK.make but I could not delete it : rm LOCK.make rm: LOCK.make: permission denied 144. Date: Wed, 21 Jun 89 17:40:18 PDT From: mendel (Mendel Rosenblum) Message-Id: <8906220040.AA69899@sprite.Berkeley.EDU> To: sprite Subject: prefix bug I had /sprite/src/kernel attached to murder under both /sprite/src/kernel and /d. If you type cd /sprite/src/kernel/dev pwd look get /d/dev as output. This breaks mkmf. 145. Date: Wed, 21 Jun 89 18:12:19 PDT From: douglis (Mary Gray Baker) Message-Id: <8906220112.AA733452@sprite.Berkeley.EDU> To: sprite Subject: bug in Vm_FindCode I got into a mode where any process trying to execute "sh" would hang in an unkillable state. This is because FindCode thinks someone else is already trying to allocate the segment, and it waits on a condition that never gets notified. Seems like this isn't an awfully high priority problem, but something worth thinking about... 146. Subject: bug with syslog Date: Thu, 22 Jun 89 12:32:44 PDT From: Fred Douglis <douglis> maybe related to the new changes in dev? the newer kernels get screwed up and only direct some output to the process that's catting /dev/syslog, with the rest going directly to the display. 147. Date: Thu, 22 Jun 89 17:13:35 PDT From: brent (Brent Welch) Subject: device reopen bug I have tested device reopening and it is ready to go, except that there is an obscure bug which I don't want to fix right now. The bug would only show up if you have a write-only stream to a remote syslog device, and the remote host reboots. Upon reopen the syslog device would erroneously be told the client has a read-write, not write-only, stream. This would confuse the syslog device because it is a single-reader device. (To fix this you'd have to close the write-only stream and reboot the server.) 148. Date: Fri, 23 Jun 89 14:14:13 PDT From: stolcke (Andreas Stolcke) Subject: spritemon When I tried to run spritemon recently on mint it gave me Floating-point exception. I was rlogged in from a non-sprite sun4, but I don't see how that could have something to do with it. 149. Date: Sat, 24 Jun 89 15:54:44 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: tx bug It looks like tx windows are missing some refresh events. If I change the window under a dialog box (like the one that says I can't write the file), and then pick "continue", the underlying window is not refreshed. 150. Date: Sat, 24 Jun 89 18:16:47 PDT From: mendel (Mendel Rosenblum) Message-Id: <8906250116.AA397619@sprite.Berkeley.EDU> To: sprite Subject: recovery bug Every 30 seconds murder prints a message 6/24/89 17:17:31 basil (5) completed recovery in its syslog. Murder is running: SPRITE VERSION 1.0 (Brent sun3) (23 Jun 89 13:03:36) and basil is running SPRITE VERSION 1.0 (Brent sun3) (14 Jun 89 17:42:58 151. Date: Sun, 25 Jun 89 15:07:10 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: spur vm bug On line 1476 in Vm_SegmentDup there is an unlock of the page pointed to by the destination PTE ptr. Unfortunately this page was not locked in the first place. Vm_SegmentDup was called by InitUserProc. I looked all through the vm code and was unable to find the place where the destination page is locked. Obviously this can't be the case, otherwise the code would never work. Could someone who understands the code better take a look at it and tell me where the page is locked? 152. Date: Sun, 25 Jun 89 18:25:28 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: netroute bug I can't install a route to tonkawa because rarp fails. I don't know why rosemary refuses to answer rarp requests, but it would be nice if I could specify the internet address of the host to netroute, or have netroute look in spritehosts. Also, what's the deal on the rarp daemon? Are our fileservers supposed to be running it, or do we depend on unix. Raid had a problem booting because no one responded to rarp. When I started the daemon on tonkawa the problem went away. 153. Date: Thu, 29 Jun 89 17:19:14 PDT From: ouster (John Ousterhout) Subject: Pmake bug? If I type "pmake cleanall" in /a/X/src/cmds/Xsprite, pmake hangs after printing the following information: mace: pmake cleanall --- cleansun2 --- pmake -l 'CC=cc' 'INSTALLDIR=/X/cmds' 'TM=sun3' TM=sun2 clean --- tidy --- %%% ddx %%% --- clean --- rm -f sun2.md/spriteBW2.o sun2.md/spriteCG2M.o sun2.md/spriteCursor.o sun2.md/sp riteGC.o sun2.md/spriteInit.o sun2.md/spriteIo.o sun2.md/spriteKbd.o sun2.md/spr iteMouse.o sun2.md/spriteBW2.po sun2.md/spriteCG2M.po sun2.md/spriteCursor.po su n2.md/spriteGC.po sun2.md/spriteInit.po sun2.md/spriteIo.po sun2.md/spriteKbd.po sun2.md/spriteMouse.po sun2.md/linked.o sun2.md/linked.po *~ sun2.md/*~ Control-C will unwedge Pmake, but the hang seems to be repeatable (i.e. there's no way to get "pmake cleanall" or "pmake clean" to complete). 154. Date: Thu, 29 Jun 89 18:04:36 PDT From: douglis@rosemary.Berkeley.EDU (Fred Douglis) Subject: bug: mint crash in Fs_PrefixDump Mint crashed with a bus error. because we don't keep sources under unix, i wasn't able to find out much about what was going on other than a backtrace and a local variable list. I dumped *prefixPtr and it was garbage (list pointing to 1 and 4 instead of normal addresses, and so on). This happened right after oregano had its problems with prefix-related operations hanging after the ipServer died. I rebooted mint with my new kernel, which I will copy over to rosemary as soon as mint comes back. (Mint had been running the JHH kernel, which has who-knows-what in it; my kernel has the installed everything except the new change for the process timer free() bug, which would have eventually crashed mint in any other kernel.) 155. Subject: bug: ipServer looping? Date: Thu, 29 Jun 89 23:21:54 PDT From: Fred Douglis <douglis> I'm getting pretty awful response when logged in from home, and I noticed that the 5-minute load average is over 1 although there are none of the usual suspects (cc's and whatever) around. However, the ipServer seems to be in the READY state all or most of the time, at least while I am logged in. Has anyone else noticed this behavior? 156. Subject: bug: pmake messed up big time Date: Fri, 30 Jun 89 19:12:00 PDT From: Fred Douglis <douglis> see anything funny with this? cd /sprite/src/lib/c/mig/ pmake -k debug --- sun3.md/Mig_ConfirmIdle.go --- rm -f sun3.md/Mig_ConfirmIdle.go cc -O -msun3 -I. -Isun3.md -g -c Mig_ConfirmIdle.c -o sun3.md/Mig_Confirm Idle.go --- ../sun3.md/libc_g.a --- ar r ../sun3.md/libc_g.a sun3.md/Mig_ConfirmIdle.go ar: filename Mig_ConfirmIdle.go truncated to Mig_ConfirmIdle /sprite/cmds.sun3/ranlib ../sun3.md/libc_g.a --> rm -rf sun3.md/Mig/sprite/cmds.sun3/ranlib ../sun3.md/libc_g.a rm -rf sun3.md/MigAsciiToInternal.go sun3.md/MigGetLocalName.go sun3.md/MigI nternalToAscii.go sun3.md/Mig_ConfirmIdle.go sun3.md/Mig_Done.go sun3.md/Mig_Get AllInfo.go sun3.md/Mig_GetIdleNode.go sun3.md/Mig_GetInfo.go sun3.md/Mig_OpenInf o.go sun3.md/Mig_UpdateInfo.go 157. Date: Sun, 2 Jul 89 18:51:12 PDT From: mgbaker (Mary Gray Baker) Subject: redirect bug If I try to move /tmp/goo to /a/attcmds/csh/sun4.md/csh, the sun3.new kernel crashes with a VmRawAlloc out of memory bug. It is dying in FsLookupRedirect at line 564 with a prefixLength that is total garbage. 158. From: rab (Robert A. Bruce) Subject: bug: makedepend Date: Mon, 03 Jul 89 23:43:44 PDT makedepend apparently goes into an infinite loop when I run mkmf in /a/newcmds/cc1.68k. 159. Date: Fri, 7 Jul 89 21:44:37 PDT From: douglis (Fred Douglis) Subject: bug: oregano died with leftover indirect block yet again. it was down for over an hour, including the time needed to check its disks when I rebooted. Mint crashed with the same complaint earlier today. 160. Date: Sun, 9 Jul 89 22:04:09 PDT From: brent (Brent Welch) Subject: pmake sun4 TM bug pmake on a sun4 doesn't default to TM=sun4 correctly, it defaults to sun3. However, on the plus side, I was able to compile and install a working rshd from anise for the sun4s. 161. Date: Mon, 10 Jul 89 13:30:21 PDT From: douglis (Fred Douglis) Subject: bug: FsRemoteDomainInfo: waiting for recovery this should probably time out instead of waiting for recovery. Otherwise,. it seems that a down host can cause all operations involving the prefix table to hang indefinitely, including anything one might try to remove the offending entry in the first place. 162. Date: Mon, 10 Jul 89 18:25:49 PDT From: mgbaker (Mary Gray Baker) Subject: assembler bug for sun4 assembling The sprite (gnu) assembler calls abort() when it sees a load or store instruction to an alternate space. This means I can't assemble most of the sun4 kernel assembly code since it's got a lot of loads and stores to control space, etc. 163. Date: Tue, 11 Jul 89 18:32:41 PDT From: mgbaker (Mary Gray Baker) Subject: ld bug for linking sun4 stuff The linker gets a segmentation violation when I try to link my sun4 kernel. There could certainly be something wrong with the obj's I'm trying to link, but what the debugger is saying makes no sense. 164. Subject: bug: ggraph broken Date: Wed, 12 Jul 89 01:53:58 PDT From: Fred Douglis <douglis> the installed version gave me a bizarre line on an input file that generated a good graph on unix. remembering andreas's comment about floats and doubles in gcc, i tried recompiling after changing all floats to doubles, but this time i hit a bus error running ggraph. 165. Date: Wed, 12 Jul 89 12:43:39 PDT From: pmchen@sprite.Berkeley.EDU (Peter M. Chen) Subject: bug report on gettimeofday I seem to be going backwards in time once in a while. The following is a trace of my program. tp.tv_sec=616275411, tp.tv_usec=910000 tp.tv_sec=616275410, tp.tv_usec=960000 Note that in the last line, tv_sec has gone backwards one second. This seems to be consistent on tv_usec = 960000, but not every time. For example, 166. Date: Thu, 13 Jul 89 12:30:55 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: mx bug My mx window appeared to go into an infinite loop. It was in CharToLine, and the increment was flipping between -1 and 1. There is a core in my home directory named core.91a34 if someone wants to look at it. 167. Subject: bug? /hosts protections Date: Thu, 13 Jul 89 15:05:03 PDT From: Fred Douglis <douglis> Just about all the /hosts/*.EDU directories are mode 777. Anyone know why this is the case? Makes /hosts/.../nologin a bit of a problem. 168, Subject: bug: setpriority() not implemented Date: Thu, 13 Jul 89 17:13:03 PDT From: Fred Douglis <douglis> Garth, just a warning if you should use sprite for your simulations. the unix setpriority() call just returns success without doing anything. I think this may be because unix and sprite priorities are implemented differently. In sprite, a priority of "-1" means double all charged usage, while "-2" means quadruple it, and so on. Since unix priorities are linear instead of exponential, someone could have undesired consequences if he used two different unix priorities in one way and in sprite the relative difference was greater. 169. Date: Thu, 13 Jul 89 23:59:38 PDT From: mgbaker (Mary Gray Baker) Subject: pmake hanging bug Different parts of pmakes keep hanging randomly. If I kill and restart the pmake, it usually goes just fine. Perhaps it has to do with the choice of machines, since when it's restarted it usually gets a different machine. 170. Date: Fri, 14 Jul 89 12:11:10 PDT From: pmchen@sprite.Berkeley.EDU (Peter M. Chen) Message-Id: <8907141911.AA76841@sprite.Berkeley.EDU> To: /sprite/users/pmchen/mail/sprite/mbox, sprite@sprite.Berkeley.EDU Subject: su-suspend bug If I su, suspend the process, then fg it, the su process ends. Am I doing this wrong (ie. do I need to do this differently than on UNIX)? 171. Date: Fri, 14 Jul 89 13:43:16 PDT From: mgbaker (Mary Gray Baker) Subject: Another gcc bug Gcc seems only to look at the size of a structure before determining whether to use byte, half-word or whole-word loads and stores for structure assignment. This doesn't take into account alignment of the structue. The following code seg faults because it attempts to do half-word loads and stores on an odd boundary. 172. Date: Fri, 14 Jul 89 14:35:32 PDT From: mgbaker (Mary Gray Baker) Message-Id: <8907142135.AA133167@sprite.Berkeley.EDU> To: sprite Subject: Gcc alignment bug Okay, MAYBE this isn't really a bug, but it sure would be nice if things were aligned or at least sized so that they would be aligned. The initialized odd-length string here causes the following initialized structure to be on an odd byte boundary. This happens in about 5 or 6 places in the kernel and causes all sorts of havoc when combined with the gcc bug that does loads and stores based on the size of a structure regardless of its alignment. 173. Date: Fri, 14 Jul 89 18:23:10 PDT From: mendel (Mendel Rosenblum) Subject: printing to lw477 broken When I try to print to lw477 I get the message: <51>Jul 14 18:15:21 lpd[c1139]: lw477: ioctl(TIOCLBIS): invalid argument but no output. 174. Subject: race condition bug w/ migration Date: Fri, 14 Jul 89 19:06:30 PDT From: Fred Douglis <douglis> doing many migrations in parallel seems to cause "non-ready process in ready queue" on an infrequent basis. the non-ready process has the state PROC_EXITING but a backtrace indicates it thinks it should be waiting for an RPC. i'll look into this ASAP. 175. Date: Sun, 16 Jul 89 02:30:35 PDT From: eklee (Edward K. Lee) Subject: possible tx geometry bug Executing "tx =NxM+X+Y" results in a tx window with only M-1 rows. However, executing "geometry =NxM+X+Y" from tx does give you M rows. 176. Subject: bug: a.out.c out of date? Date: Mon, 17 Jul 89 11:23:41 PDT From: Fred Douglis <douglis> There are references to Aout_PageSize that appear to subscript into the array based on M_SPARC while Aout_PageSize is only set up for M_68020. The source file is /sprite/src/lib/c/etc/a.out.c. 177. Subject: bug: full kernel build disk Date: Mon, 17 Jul 89 17:54:35 PDT From: Fred Douglis <douglis> oregano hung up again when Mendel tried to remove something from /sprite/src/kernel and it was full. I was able to free up a large chunk of space without getting hung, somehow -- I removed /sprite/src/kernel/sprite/sun3.{old,23Jun...}. 178. Subject: bug: tftpd causing lingering kernel lost+found files Date: Tue, 18 Jul 89 12:03:43 PDT From: Fred Douglis <douglis> /sprite/src/kernel had 75 megabytes in lost+found, so I tried to remove the files. They were almost all mgbaker kernels. After removing them, the disk space didn't get reclaimed. I poked around a bit and eventually found that mint has about 20-30 tftpd processes lying around. I think they must have open handles on the sun4 kernel files. Do we have a tftpd maintainer in the house? 179. Subject: bug: lost+found reference counts Date: Tue, 18 Jul 89 14:19:38 PDT From: Fred Douglis <douglis> some of these are clearly bogus: drwxrwxr-x 0 root wheel 8192 Jul 7 15:50 /a/lost+found drwxrwxr-x -3 root wheel 8192 Jul 7 15:50 /b/lost+found drwxrwxr-x 2 root wheel 16384 Jul 8 18:03 /c/lost+found drwxrwxr-x -1 root wheel 8192 Jul 12 17:47 /sprite/lost+found drwxrwxr-x 2 root sprite 5 180. From: rab (Robert A. Bruce) Subject: read error Date: Thu, 20 Jul 89 07:02:45 PDT The dump program crashed last night after getting a read error on the file /sprite/spool/mail/mgbaker. The error occured at byte offset 51200. 181. Date: Thu, 20 Jul 89 23:47:35 PDT From: shirriff (Ken Shirriff) Subject: Mail got messed up One of the messages in my mail file seems to have got messed up somehow. For some reason, 12 lines of Tex appeared in my mail file: 182. Subject: bug: nfs symbolic links incompatible Date: Thu, 20 Jul 89 23:58:42 PDT From: Fred Douglis <douglis> I made a set of symbolic links on /rosemary/spare, running on sprite, and then tried to reference them from dill (running ultrix). It complained they were invalid. rosemary also misbehaved, though in rosemary's case "cat foo" would list the name of the file foo points to, as though it weren't a symbolic link and the contents were being printed. sprite acted like ultrix: paprika% ln -s foo bar paprika% cat bar bar: invalid argument I removed the links on dill and recreated them running on dill. This time they worked. The resulting links were readable by all hosts. Is this a case of sprite and unix having inconsistent sizes (relating to the trailing null character, maybe)? 183. Subject: bug with kernel idle time var. Date: Fri, 21 Jul 89 13:09:19 PDT From: Fred Douglis <douglis> there used to be a special check to only update the idle time on keyboard or mouse input. looks like now serialB updates it too, so printing causes eviction. 184. Subject: bug making libraries Date: Sat, 22 Jul 89 14:15:35 PDT From: Fred Douglis <douglis> I am trying to create libX11.a for the ds3100. When I went into the source directory and did a pmake, it made all the object files but produced a lot of empty "ar r" lines that didn't actually replace the object files or remove them. In some cases they actually were added to the archive, but not usually, and i don't see a pattern explaining why it only happened some times. "pmake -n" listed a bunch of commands to do the actual "ar" commands, but "pmake" by itself did the empty "ar" commands again. I finally broke down and am doing a single "ar ... */ds3100.md/*.o" from the shell. 185. Date: Sun, 23 Jul 89 11:43:10 PDT From: mendel (Mendel Rosenblum) Subject: Can't start X without ipServer Xsprite jumps into the debugger when it is started and the ipServer is not running. No message is produced, xinit just hangs. 186. Subject: bug: lpd repeatedly restarting Date: Sun, 23 Jul 89 20:20:59 PDT From: Fred Douglis <douglis> with the new serial line driver, when lw477 ran out of paper, I get messages saying things like <54>Jul 23 20:19:29 lpd[50b39]: restarting lw477 Warning: receiver overrun on serialB Warning: receiver overrun on serialB Warning: receiver overrun on serialB <54>Jul 23 ... lpd[50b39]: restarting lw477 i don't believe i ever saw this behavior using the old kernel. 187. Date: Fri, 21 Jul 89 09:18:01 PDT From: ouster (John Ousterhout) Message-Id: <8907211618.AA138019@sprite.Berkeley.EDU> To: sprite Subject: Bug: crash during boot Mace crashed twice in a row while booting "sun3.ouster" this morning. The crash happened just after messages appeared on the console about initiating recovery, relatively early in the boot process. Here's some information f rom Kgdb: Stack: #0 0xe0575b0 in Timer_ScheduleRoutine (newElementPtr=(Timer_QueueElement *) 0xe 07c090, interval=1) (timerQueue.c line 374) #1 0xe04d9da in RpcDaemonWait (queueEntryPtr=(Timer_QueueElement *) 0xe07c090) (rpcDaemon.c line 418) #2 0xe04d3f6 in Rpc_Daemon () (rpcDaemon.c line 109) #3 0xe0523c0 in Sched_StartKernProc (func=(void (*)()) 0xe04d3b8) (schedule.c li ne 839) At this point in the code, itemPtr was 0xffffffff, and I found a bogus element at the end of the timer queue. The contents of the element were: (links = (prevPtr = 0xffffffff, nextPtr = 0xffffffff), routine = 0xe04da8a, time = (seconds = 16, microseconds = 330000), clientData = 0xffffffff, processed = 0 , interval = 2000) The "routine" was pointing to Rpc_DaemonWakeup. 188. Date: Sun, 23 Jul 89 21:48:11 PDT From: ouster (John Ousterhout) Subject: Adding a new ds3100 This one is for the bug list: I suggest that we should modify our version of bootp to read /etc/spritehosts, so that it isn't necessary to modify /etc/bootptab whenever new hosts are added. 189. Date: Mon, 24 Jul 89 13:52:07 PDT From: mendel (Mendel Rosenblum) Subject: bug in fsstat output The Internal fragmentation statistics from the fsstat command are totally bogus. I've fixed the bug in the kernel routine Fs_CheckFragmentation that caused this problem. 190. Date: Mon, 24 Jul 89 18:16:18 PDT From: mendel (Mendel Rosenblum) Subject: bug in timing on ds3100 The csh time command givens bogus numbers on the ds3100 running Sprite. The CPU time is greater than the wall clock time. For example: pride% cat direntires* | awk -f a > /dev/null 186.5u 12.3s 1:27 227% 0+0k 0+0io 0pf+0w 212+5101csw 191. Date: Tue, 25 Jul 89 08:57:22 PDT From: mendel (Mendel Rosenblum) Subject: bug in bootp The bootp deamon goes into an infinite CPU loop if you kill the ipServer. 192. Date: Tue, 25 Jul 89 09:51:26 PDT From: mendel (Mendel Rosenblum) Subject: ipServer on mint died When I came in this morning the ipServer on mint was in the debugger. It died in malloc() with a segmentation fault because the large memory pool free list was corrupted. I couldn't figure what caused the problem but the memory near the corrupted pointer contained the string "Copyright (C) 1989 Digital Equipment Corporation." I think it might of just choked on this :-) 193. Date: Tue, 25 Jul 89 14:30:31 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: sun3.new broken I tried to boot sun3.new on mint and fscheck failed because it couldn't read /dev/rxy0a. Did something change in the dev module? 194. Subject: differences between ansi C and DEC C Date: Tue, 25 Jul 89 16:15:53 PDT From: Fred Douglis <douglis> I'm running into a lot of trouble porting certain programs to sprite, because the ultrix compiler doesn't understand the same things. For example, in diff, "void *" causes headaches, and I had to put in #ifndef __STD_C__ #define void int #endif /* __STD_C__ */ before the uses of this. Ugh. I couldn't port "file" before for a similar reason (and wound up just copying over the ultrix binary). 195. From: jhh@sprite.Berkeley.EDU (John H. Hartman) To: bugs Subject: flock broken flock() doesn't seem to work on sun3.new. It returns with an invalid argument. I don't know what the behavior is on sun3. 196. Date: Wed, 26 Jul 89 18:40:21 PDT From: mgbaker (Mary Gray Baker) Subject: directories getting locked Sometimes when I do a pmake that migrates, it hangs. I can't kill it. Then if I try to do an ls in the same directory, it hangs too and I can't kill it. The directory becomes totally unavailable. This is inconvenient. 197. Subject: bug with ds3100 ar Date: Wed, 26 Jul 89 22:20:05 PDT From: Fred Douglis <douglis> I hit the following: ar r ../ds3100.md/libc.a ds3100.md/MigAsciiToInternal.o ds3100.md/MigGetLoca lName.o ds3100.md/MigInternalToAscii.o ds3100.md/Mig_ConfirmIdle.o ds3100.md/Mig _Done.o ds3100.md/Mig_GetAllInfo.o ds3100.md/Mig_GetIdleNode.o ds3100.md/Mig_Get Info.o ds3100.md/Mig_OpenInfo.o ds3100.md/Mig_UpdateInfo.o ar: Info: filename MigAsciiToInternal.o truncated to MigAsciiToInter ... ar: Warning:ignoring second definition of MigAsciiToInternal defined in arch ive ... indeed, there are two copies with the same name in there. 198. Subject: bug: ipserver dying hangs console; migrating prefixes Date: Thu, 27 Jul 89 17:46:43 PDT From: Fred Douglis <douglis> this has probably been reported before; maybe we can boost its priority. when mint's ipserver died this afternoon, we were unable to login at the console to kill it and start a new one. we could not migrate to mint because mint was running an old version of migration. finally, jhh suggested that i rlogin to tonkawa and migrate from there. (this worked, but only after i cd'd to /sprite, since "/" on tonkawa is different from "/" on mint, and mint tried to load its prefix table by broadcasting for "/" when it didn't already have a handle for "/" on tonkawa.) anyway, i was able to kill the ipserver once i could find it, and ken restarted mint's servers. 199. From: rab (Robert A. Bruce) Subject: problems with /user1 Date: Thu, 27 Jul 89 15:45:35 PDT Martha reported the following problem with /user1: > My sprite account (/user1/zimet) appears to be hosed... > I have been having problems all day with rsh, rcp, etc. > into my directory on sprite. Is this usual? 200. Date: Fri, 28 Jul 89 09:17:17 PDT From: mendel (Mendel Rosenblum) Message-Id: <8907281617.AA528685@sprite.Berkeley.EDU> To: bugs Subject: /user1 unreadable from cory The problem is that the correct netroute command is not being run on allspice. It should run a "netroute -s" before installing /etc/spritehosts into the kernel. I have no idea which of the several bootcmds is getting run. The copies are in /boot/bootcmds /hosts/allspice/bootcmds /allspiceA/hosts/allspice/bootcmds which one should be modified? 201. From: rab (Robert A. Bruce) Subject: allspice out of memory Date: Sat, 29 Jul 89 03:59:58 PDT Allspice ran out of memory while /user1 was being dumped. 202. From: rab (Robert A. Bruce) Subject: trashed file Date: Tue, 01 Aug 89 02:03:32 PDT /sprite/src/lib/c/ctype/isdigit.c was trashed. I moved the file into isdigit.c.trash and restored the RCS'ed version. This is the garbage that was in the file: -------------------------------------------------------------------------------- isdigit(LIB $(LINTLIB) : $(SRCS:M*.c) $(HDRS) MAKELINT d207 4 a210 3 library : $(REGLIB) profile : $(PROFLIB) lint : $(LINTLIB) d212 5 a216 4 -------------------------------------------------------------------------------- 203. Date: Fri, 28 Jul 89 09:21:46 PDT From: mendel (Mendel Rosenblum) Subject: fscheck bug Fscheck on allspice running on partition /user1 produced 504 messages of the form: Block count corrected for file 73341. Is 8 should be 6. ... Block count corrected for file 73366. Is 8 should be 5. And 28 messages of the form: File zimet/X11R3/mit-dist/X11/bitmaps/right_ptrmsk references non-allocated desc riptor 12987. File Deleted. ... File zimet/X11R3/mit-dist/X11/bitmaps/sipb references non-allocated descriptor 1 2990. File Deleted. Is somethink broken here? 204. Subject: bug: random address fault after recovery Date: Fri, 28 Jul 89 09:34:15 PDT From: Fred Douglis <douglis> I found that a window of mine had gone away, though I saw no msg in my syslog to account for it (such as a page fault problem). However, when I tried to restart the program (emacs), it hit a bus error immediately. When I killed the debuggable process and tried again, it worked okay. I have no idea how to repeat this bug, but I thought it would be worth reporting in case it becomes more common (big game). 205. Subject: I want to debug hanging migrations Date: Fri, 28 Jul 89 10:41:57 PDT From: Fred Douglis <douglis> People have become fairly complacent about problems with the system, killing processes and/or rebooting when things break rather than taking the time for someone to investigate the problem in detail. This makes it harder to identify the problems when they arise. At this point, there's one bug in particular that I'd like to ask people to tell me about immediately: if a pmake hangs part-way through, I want to debug the two machines involved and figure out what's going on. If I'm on the system, please come to me rather than killing the pmake. (Spriters: this is related to the bug Mary saw w.r.t. file locks. I didn't see the simple explanation I hoped to see, so I need to look into this the next time it comes up instead.) 206. Date: Fri, 28 Jul 89 14:19:36 PDT From: ouster (John Ousterhout) Subject: DS3100 bug: not enough processes? While beating on Pride to flush out the ipServer bug I created lots of processes. At one point the kernel entered the debugger with the message "Mach_SetupNewState: Out of machine state structs". Sounds like maybe the limit on # of processes and the number of states in Mach don't match. 207. Subject: ds3100 bug: WaitForSomething message Date: Fri, 28 Jul 89 15:10:50 PDT From: Fred Douglis <douglis> I keep getting "WaitForSomething(): select: errno=73" blasted on the console of the ds3100, despite having a window catting /dev/syslog. 208. Date: Fri, 28 Jul 89 15:16:04 PDT From: mgbaker (Mary Gray Baker) Subject: tx extra selection stripe Sometimes in tx I get a black stripe to the right of the cursor that won't go away. It looks just like a selection, but it isn't the selection since it stays when I select something elsewhere. Clearing the window, etc, doesn't get rid of it. How do I make it go away? 209. Subject: bug: server deadlocks Date: Fri, 28 Jul 89 16:18:29 PDT From: Fred Douglis <douglis> the time /a filled up, i couldn't get out of a process on kvetching (swapping off of allspice) and i saw a message about a remove RPC to allspice being hung. It's bad enough when a remove on a full disk gets hung, but when a remove on an empty disk on another machine gets hung, something's pretty bad. some of us have also noticed that allspice has had a tendency to hang or crash when mint or oregano dies. any suggestions about what might be causing this interdependency would certainly be appreciated! 210. Subject: bug: can't backtrace user stack in kgdb Date: Fri, 28 Jul 89 16:43:07 PDT From: Fred Douglis <douglis> I am trying to find out why a migrated process is in the WAIT state, but when I do "where" from kgdb it just returns without printing anything, and "i r" prints 0 for all the registers. Seems like the debugging interface is screwed up. This is the Jul24 installed kernel. 211. Date: Fri, 28 Jul 89 19:51:21 PDT From: gibson (Garth Gibson) Message-Id: <8907290251.AA722218@sprite.Berkeley.EDU> To: bugs Subject: ds3100 I've tried to port my simulation code to the 3100s (kvetching). After Fred fixed one problem I ran into this: It appears to go through initialization, including a printf, then it hangs. When I run it under dbx and arbitrarily ^C, I get: Interrupt [scalb, :0x408474] swc1 f20,20(sp) (dbx) where > 0 scalb(x = 1.0, N = 54) [0x408474] 1 scalb(x = 1.0, N = 54) ["ds3100.md/support.c":98, 0x40853c] 2 scalb(x = 1.0, N = 54) ["ds3100.md/support.c":98, 0x40853c] 3 scalb(x = 1.0, N = 54) ["ds3100.md/support.c":98, 0x40853c] 4 scalb(x = 1.0, N = 54) ["ds3100.md/support.c":98, 0x40853c] 5 scalb(x = 1.0, N = 54) ["ds3100.md/support.c":98, 0x40853c] and at least 400 more lines identical to the last 5. When I stopped at a particular address and "next"ed forward I get: [2] stopped at [.block2:612 ,0x401144] if( st_time_til_loss.cnt>=iters ) { (dbx) next [.block3:638 ,0x4014ec] for( i=0; i<num_disks; i++ ) { (dbx) next [.block3:639 ,0x401508] disks[i].failed = FALSE; (dbx) next [.block3:640 ,0x40152c] if( init_fail_rate != 0 ) { /* use Brady lifetim e distr */ (dbx) next Illegal instruction [.block3:640 +0x1c,0x401548] if( init_fail_rate != 0 ) { /* use Brady lifetime distr */ (dbx) where > 0 .block3 ["reli.c":640, 0x401548] 1 .block2 ["reli.c":640, 0x401548] 2 main(argc = 1, argv = 0x7fdffd0c) ["reli.c":640, 0x401548] and Fred tells me that kvetching's console got a message about "invalid breakpoint". I'm declaring failure for awhile, so I'll copy my code (~gibson/RELI/reli.c) to (~gibson/RELI/reli.c.bug) and leave the executable (same/ds3100.md/RELI). 212. Date: Sat, 29 Jul 89 09:53:11 PDT From: mendel (Mendel Rosenblum) Subject: allspice recovery damages processes on murder Just after allspice recovered last night a Xsprite, tx, and cat /dev/syslog I had running on murder entered the debugger with a segmentation fault. 213. Date: Sat, 29 Jul 89 15:12:52 PDT From: gibson (Garth Gibson) Message-Id: <8907292212.AA66852@sprite.Berkeley.EDU> To: bugs Subject: reseting tx when i break out of top in an odd way (in this case, I killed a process from within top and somehow this terminated the top) none of my keystrokes are echoed when this happened on BSD i did a "reset" but reset in tx says "Type tx unknown" using the menu entry "clear and reset window" also fails to turn keystroke echoing back on 214. Date: Sun, 30 Jul 89 15:45:16 PDT From: gibson (Garth Gibson) Subject: nfsmount core leak ? Basil is currently experiencing substantial paging whenever I do anything (ie., in particular copy from nfs to nfs causes > 15 page faults per second and the little copy (24KB) takes more than 10 seconds). Basil is the server for the nfsmount of /spur. It is only an 8MB machine and although I do have 12 windows (10 tx) and 5 rsh's running, but the problem appears to be nfsmount - it is at 4.2 MB. When I do things that involve local execution, nfsmount is paged out; when I do things across nfs, about 2MB are paged in. I killed nfsmount and restarted it and its memory usage was only 184 KB. I did a giant ls -R across nfs and it grew to 312 KB but seemed to stay there. Mendel speculated that this might be a core leak in nfsmount. Does anyone want to run nfsmount for /spur on their machine? 215. Date: Mon, 31 Jul 89 14:23:45 PDT From: deboor (Adam R de Boor) Subject: vi segv I logged in to thyme from envy, so my rows and columns were 0,0. When I did an stty rows 61 (forgetting that columns would be 0) and foregrounded a vi, it complained about screen too large for internal buffer, then died with a segv. It's on the debug queue on thyme (pid e1a49) if anyone wants to look at it. If not, could someone kill it for me :) 216. Subject: bug: ds3100 exec.h/a.out.h inconsistency Date: Tue, 01 Aug 89 13:51:11 PDT From: Fred Douglis <douglis> Programs that use a.out.h won't compile for the ds because N_TXTOFF is called with one param in a.out.h but defined to take two params in sys/exec.h. 217. From: jhh@sprite.Berkeley.EDU (John H. Hartman) To: bugs Subject: pmake all does ds3100 I did a "pmake all" on a sun3 and it compiled a completely worthless ds3100 version of the program. 218. Subject: warning to people trying to debug on the ds3100 Date: Wed, 02 Aug 89 17:57:37 PDT From: Fred Douglis <douglis> Mike said something about adding support for debugging, but for the time being, it's often hard to impossible to get a backtrace of a process, depending on how it stops. I removed the mousetrap I had put in loadavg, because calling abort() wouldn't let me look at anything interesting. I also find that emacs locks up on me after I start a sub-process, maybe one time in 20 or 30, and the backtrace after a kill -DEBUG was only one call deep and was probably wrong to boot. 219. Subject: is time going backward?... Date: Wed, 02 Aug 89 19:54:34 PDT From: Fred Douglis <douglis> ... or are user variables getting trashed? finger uses a kernel "idle time" variable that was causing it to get confused. It turned out that kvetching's idle time was -4 seconds. Since this is calculated by doing a Timer_GetTimeOfDay and then doing another Timer_GetTimeOfDay and subtracting the first from the second, a difference of -4 means either that the clock is getting messed up or the loadavg daemon's variables are. given the NaN I've seen, perhaps it's the second, in which case this bug report is nothing new, but I figured it could also be related to the time-flowing-backward bug that Ed reported a while ago. all in all, kvetching's clock seems fairly accurate ("date" coincides pretty well with reality). 220. Date: Thu, 3 Aug 89 12:07:42 PDT From: douglis@sprite.Berkeley.EDU (Fred Douglis) Subject: bug: header updating must change date If a header file is installed using update, it's possible for object files not to get recompiled because they've been compiled since the date when the header was written, even if they haven't been compiled since the header was installed. This could account for why the debugger still can't backtrace user processes on sun3s, since kgdb sees the wrong version of Mach_UserState. 221. Subject: bug: ds3100 clock Date: Thu, 03 Aug 89 17:31:21 PDT From: Fred Douglis <douglis> it was 5 minutes slow when I checked just now. confirmation that time may occasionally be flowing backwards, given the -4 seconds idle time i saw yesterday. 222. Subject: bug: tx caret disappearing Date: Thu, 03 Aug 89 22:32:12 PDT From: Fred Douglis <douglis> On paprika, when a tx window fills and starts scrolling, the input caret is barely visible at the bottom of the window. on kvetching, the caret disappears entirely, and i must scroll the window up so some blank space appears on the bottom in order to get a caret to appear. this occurs even if i open a window from paprika on kvetching, so it's not the ds3100 tx client (the same sun3 binary produces different results on the two different displays). 223. Date: Fri, 4 Aug 89 08:51:30 PDT From: ouster (John Ousterhout) Subject: Bug: /sprite/users directory weird Something is wrong with /sprite/users, or with du, or with ls. If I cd to /sprite/users and type "du", a bunch of lines appear for a subdirectory "cmds.ancient". Yet if I type "ls" in /sprite/users, no such directory appears, and I cannot cd to /sprite/users/cmds.ancient. This paradox appears to be repeatable, at least for me on Mace. 224. Subject: bug: inflated loadavgs Date: Fri, 04 Aug 89 11:17:05 PDT From: Fred Douglis <douglis> At least three hosts right now are listed as having load averages of over 1.0 although there are apparently no processes using up vast amounts of CPU time. I went to murder and l1-r repeatedly and there were never any ready processes. each host is running a different kernel, so it's not like a bug was just introduced. i suspect that the "numReadyProcesses" variable is getting confused but have been unable so far to find out how. If anyone knows of a repeatable case to get machines into this state please let me know. 225. Date: Mon, 7 Aug 89 11:38:47 PDT From: mgbaker (Mary Gray Baker) Subject: cc include path defaults to sun3.md If you compile something for the sun4, without explicitly putting the -I/sprite/lib/include/sun4.md back into its include path in a .mk file, it will pick up header files from /sprite/lib/include/sun3.md. I don't think this is a good idea, since it silently includes the wrong stuff in many cases. Either none of the machine types should have a default include path, or they all should have ones that work. 226. Subject: Mail installed on ds3100 Date: Mon, 07 Aug 89 11:36:27 PDT From: Fred Douglis <douglis> I figured out that Mail wouldn't link because it has some arrays of structures that the dec compiler/loader can't handle. this is really their bug rather than ours, and I am inclined to patch around it temporarily and wait for gcc rather than trying to fix the bug (since we don't have sources anyway). Maybe if Mike wants to pass the problem on to people at DEC, that would be useful? Anyway, the fix was to add "-G 0" to the cc flags so Mail is compiled without using what they call the "global pointer". 227. Subject: ds3100 bug: FPU interrupt in kernel mode Date: Mon, 07 Aug 89 12:56:51 PDT From: Fred Douglis <douglis> kvetching died with this just now. kdbx just kept printing a backtrace of an infinite number of MachFPInterrupt calls. Any suggestions of something to look at next time this happens? 228. Date: Mon, 7 Aug 89 13:59:01 PDT From: mgbaker (Mary Gray Baker) Subject: no documentation on malloc tracing Is there no man page describing how to turn on memory tracing of different sorts? You can read the code and piece it together by trial and error, but it sure would be nicer just to read a man page. 229. Date: Mon, 7 Aug 89 22:33:39 PDT From: david@rosemary.Berkeley.EDU (David A. Wood) Subject: /tmp on mace and murder There seems to be a problem with /c on both mace and murder. Since both systems have /tmp linked to /c/tmp, many programs (including mail) don't work. 230. Date: Tue, 8 Aug 89 09:36:15 PDT From: ouster (John Ousterhout) Subject: Piracy in debugger again Piracy has entered the debugger again with the message Bad kernel TLB Fault Entering debugger with a TLB LD miss exeception at PC 0x0 231. Date: Wed, 9 Aug 89 12:04:51 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: migration/rpc problem? I get messages of the following type when I do a pmake: Warning: Proc_RpcRemoteCall: invalid pid: f1a67. The pmake then hangs. I do have a process f1a67: f1a67 MIG 2802 ffffffff sage c2121 sh -ev Any idea what the problem is? 232. ubject: bug: repeated recovery Date: Tue, 08 Aug 89 15:50:04 PDT From: Fred Douglis <douglis> kvetching went into an infinite loop recovering w/ mint. mint's syslog said: 8/8/89 15:47:12 kvetching (2) starting recovery 8/8/89 15:47:15 kvetching (2) completed recovery Fs_RpcIOControl: Stream/handle mis-match Stream <32, 32, 165> => File I/O <32, 0, 1881> kvetching said file 1881 had a stale handle, and then tried again. 233. Subject: ds3100 bug: another recovery problem Date: Wed, 09 Aug 89 15:50:05 PDT From: Fred Douglis <douglis> when oregano rebooted, kvetching started printing "(" over and over on its console. One process claimed to be in the running state, and lots of others were ready. An RPC to kill the running process got hung since the rpc daemon couldn't run. I rebooted out of frustration, though I suppose I should have poked around first. 234. Subject: ds3100 bug: XIO reset Date: Wed, 09 Aug 89 18:10:32 PDT From: Fred Douglis <douglis> I occasionally have X windows just disappear. Usually they're my xbiff window or the tx that cats /dev/syslog. I get "XIO: Connection reset by peer" when this happens. Any ideas? 235. Date: Thu, 10 Aug 89 11:36:41 PDT From: douglis@rosemary.Berkeley.EDU (Fred Douglis) Subject: bug: wall and rlogins wall was never fixed to notify remote users. we have a reasonable number of such users, especially Martha, who would appreciate such notification when the world is about to end. also, wall doesn't talk to the cory hosts because /hosts/tonkawa, et al., aren't the real directories. 236. Date: Thu, 10 Aug 89 17:06:27 PDT From: shirriff (Ken Shirriff) Subject: rpn is broken I recompiled rpn and now the octal and hex functions don't work. The problem seems to be due to varargs dropping parameters. Can someone who understands varargs better than I do take a look? The problem is in src/main.c around line 147, where it calls dpyprintf. Then in dpyprintf in dpy/dpy.c, the arguments don't seem to be correct. 237. Date: Thu, 10 Aug 89 18:42:04 PDT From: eklee (Edward K. Lee) Subject: fscmd Sometime when I execute fscmd -f, I get a message saying "1 locked blocks left". What does this mean? The number of locked blocks seem to accumulate over time. 238. Date: Fri, 11 Aug 89 09:50:34 PDT From: ouster (John Ousterhout) Subject: Stale handle warnings I've gotten 3 stale handle warnings this morning: 8/11/89 8:49:13 oregano (38) RmtFile "/tmp//Mx.Re334.1" <3,55891> Write-back fai led: stale handle 8/11/89 9:44:36 mint (32) RmtFile "tfAA858935" <1,62649> Write-back failed: stal e handle 8/11/89 9:44:41 mint (32) RmtFile "/sprite/spool/mail/douglis" <1,1010> Write-ba ck failed: stale handle I've also gotten 4 "oregano (38) completed recovery" messages this morning, even though neither mace nor oregano has crashed. 239. Date: Fri, 11 Aug 89 10:02:51 PDT From: ouster (John Ousterhout) Message-Id: <8908111702.AA596784@sprite.Berkeley.EDU> To: bugs Subject: Bug: finger timing out on pepper Whenever I run "finger" right now, the following messages appear in my syslog window: <getIOAttr> 8/11/89 9:59:20 pepper (16) RPC timed-out FsRemoteGetIOAttr failed <30002>: device <0,3343505> at server 16 240. Subject: stale handles Date: Fri, 11 Aug 89 10:15:54 PDT From: Fred Douglis <douglis> perhaps this is confirmation that the "stale handle" warnings and trashed files are related. John reported "write-back failed" on my spool file, and twice this morning my mail file has been corrupted (nulls in-between two messages). 241. Date: Fri, 11 Aug 89 11:17:47 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: ds3100 mail problems If I type 'ctrl-C' while composing a mail message I get the standard "Interrrupt -- one more to kill letter" message. The second "ctrl-C" doesn't do anything. I have to type "ctrl-Z" and kill the job. 242. Subject: bug: null length symlinks Date: Fri, 11 Aug 89 13:28:01 PDT From: Fred Douglis <douglis> When /c filled up before, the symbolic links I created wound up as 0 length (pointing to nothing). Directories weren't created due to lack of disk space -- couldn't the same logic be applied to symbolic links, rather than creating 0-length links? (I don't think the problem affects files, since once space was freed up the files were apparently written okay.) 243. Date: Fri, 11 Aug 89 13:44:46 PDT From: brent (Brent Welch) Subject: zero length symbolic links Indeed, the current implementation of symbolic links has a number of problems, including the feature of creating zero-length symbolic links when the disk is full. That problem can be fixed in Fs_SymLink by removing the link if the Fs_Write fails. However, a more fundamental problem is that the creation of symbolic links should be a "domain dependent" operation instead of being composed of the open, write, and close "domain dependent" operations. (The problem with disk full still has to be addressed with this arrangment.) If we make this change then we'll be able to create symbolic links in NFS domains correctly. (Interestingly, while the NFS protocol has a SYMLINK RPC, it also allows you to create a file of type symbolic link and write a value to it. It's too bad that this works because it means that we can create sprite-like symbolic links in NFS domains. The difference is in the presense (in sprite) of a trailing null.) brent ps. The file servers already guard against zero-length links, so oregano just complained about them. 244. Date: Fri, 11 Aug 89 14:02:34 PDT From: shirriff (Ken Shirriff) Message-Id: <8908112102.AA918313@sprite.Berkeley.EDU> To: bugs Subject: Compiler bug On the sun3, if I cast a double to an unsigned int, I get 0. Casting a float to unsigned int or double to int works. (This is why rpn wasn't working.) 245. Date: Fri, 11 Aug 89 14:56:26 PDT From: brent (Brent Welch) Subject: System failures Mint and Oregano crashed and turned up (at least) three bugs. 1) pwd in a psuedo-file-system isn't fully correct. There is new code to return the prefix associated with an open file, and this crashed Oregano. The pwd was on sage, and the nfsmount was running on Oregano. I think the bug is that the shadow stream descriptor on Oregano (the shadow of the stream set up on sage) isn't setup the same as the real stream descriptor on sage, and the code should use the client's information instead of forwarding the operation to the server. If that isn't clear, then don't worry about it, I think I have a handle on it. 2) Mint got an open error on a file in /c/tmp because Oregano was down. It then erased its handle information for the /sprite prefix, oops. Needless to say, this prevented Oregano from completing its boot sequence, and required a restart of mint. I don't know, yet, why mint would do such a thing. It may have been confused by pathname redirection, /tmp => /sprite/tmp => /c/tmp. After getting the error on /c/tmp it wrongly erased information about /sprite instead of /c. 3) After Mint rebooted /tmp was gone. Apparently this has happened before. I suspect something in mints boot script. 246. Date: Fri, 11 Aug 89 16:51:56 PDT From: shirriff (Ken Shirriff) Subject: kgdb problem I ran into a problem debugging on allspice with kgdb.sun3. The debugger would crash with a segmentation violation when I tried to examine a particular structure. I tried to recompile kgdb.sun3 to help find the problem, but when I try to recompile kgdb.sun3/values.o, cc1.sparc dies and the cc hangs. 247. Date: Fri, 11 Aug 89 17:42:46 PDT From: mendel (Mendel Rosenblum) Subject: sun4 compiler problem When compiling the fstat program for the sun4, gcc generates references to the undefined symbol ___fixunsdfsi. 248. Date: Sat, 12 Aug 89 10:07:11 PDT From: ouster (John Ousterhout) Message-Id: <8908121707.AA793393@sprite.Berkeley.EDU> To: bugs Subject: Bug in finger idle times? I received the following output from finger at about 10:00 this morning: ... Notice that every rlogin-ed connection has an idle time of 3 minutes, even though none of the supposed users is actually here working. Furthermore, notice, for example, that Fred's idle time on Allspice is 3 minutes, yet his idle time on Kvetching, the source of the connection to Allspice, is many hours. I checked /hosts/allspice/rlogin*, and two of the files, rlogin1 and rlogin3, really do have last-access times of 9:56 this morning. I suspect that it is no coincidence that Oregano finished a reboot at exactly the claimed last-access time of all these rlogin connections. It appears to me that something related to recovery (device re-open?) is updating the access times when it shouldn't. 249. Subject: sun4 bug: rlogind hung Date: Sun, 13 Aug 89 18:37:14 PDT From: Fred Douglis <douglis> I hit ^C and started typing. I then saw "Fs_Dispatch: stream ID 257 out of range" and my rlogin to allspice hung. 250. Date: Sat, 12 Aug 89 11:06:21 PDT From: mendel (Mendel Rosenblum) Subject: mouse problems on sun4 If you move the mouse on anise while doing a compile you get the message "Warning: receiver overrun on mouse" printed in the syslog and the system acts like you pushed down a mouse button. Many times this causes a uwm menu to appear and then disappear. 251. Date: Sat, 12 Aug 89 11:49:33 PDT From: mendel (Mendel Rosenblum) Message-Id: <8908121849.AA995596@sprite.Berkeley.EDU> To: bugs Subject: malloc on sun4 doesn't align memory correctly Malloc on the sun4 returns objects only aligned to a four byte boundary. This means that mallocing double floating point variables will fail. For example: struct foo { /* other stuff */ double max; /* more other stuff */ } *foo; main() { foo = malloc(sizeof(struct foo)); foo->max = 0.0; } seg faults everytime on Sprite. The large memory allocator appears to align stuff correctly. 252. From: rab (Robert A. Bruce) Subject: allspice crashed Date: Mon, 14 Aug 89 01:08:53 PDT Allspice crashed while /user1 was being dumped. Pmeg lists empty Program received signal 16, Interrupt Trap #0 panic (__builtin_va_alist=-167186280) (sysPrintf.c line 188) #1 0xf608f128 in PMEGGet () (sun4.md/vmSun.c line 1329) #2 0xf6090e18 in VmMach_PageValidate () (sun4.md/vmSun.c line 3109) #3 0xf6087678 in VmPageValidateInt () (vmPage.c line 644) #4 0xf6088990 in PreparePage () (vmPage.c line 1657) #5 0xf608848c in Vm_PageIn () (vmPage.c line 1470) #6 0xf600fa80 in testModuloLabel () ERROR: invalid read address 0xcac4 253. Date: Mon, 14 Aug 89 11:59:48 PDT From: brent (Brent Welch) Subject: Allspice crashed, level 15 interrupt Allspice crashed again with a level 15 interrupt error. Mendel says that this means that the cache hit a protection error during a write-back. This is an asynchronous error so we couldn't really figure out the exact details of the problem. We were able to continue allspice, and rlogind ended up in the debugger because it had the affected page. 254. Date: Mon, 14 Aug 89 12:04:07 PDT From: brent (Brent Welch) Subject: mint erased "/sprite" again When allspice crashed mint erased its prefix table entry for "/sprite". I rebooted mint with a new kernel that supposedly guards against this, but it didn't help. I logged in as root and typed "cd sprite" and it immediately printed out "Broadcasting for server of /sprite", oops. This seems repeatable, although I bet that allspice (or oregano) has to be down at the time. By the way, the machines were down for only 1/2 hour today (11:18 to 11:50) during all of this. I'll wait until "after hours" to reenact the problem with mint and /sprite. 255. Date: Mon, 14 Aug 89 15:32:19 PDT From: shirriff (Ken Shirriff) Subject: undefined net routines There are a bunch of routines used in netCode.c and netRoute.c that aren't defined: Net_InetChecksum, Net_InetChecksum2, Net_InetAddrToString, and Net_EtherAddrToString. I can't compile a kernel because these aren't defined, so if anyone knows what the situation is with these, please let me know. 256. Subject: bug: lpd broken Date: Mon, 14 Aug 89 16:59:18 PDT From: Fred Douglis <douglis> i saw an error message the last time i booted paprika, and thyme now has 3 lpds in the debugger. someone install a broken version recently? 257. Date: Tue, 15 Aug 89 11:31:16 PDT From: shirriff (Ken Shirriff) Subject: Evil black blob in tx To repeatably create the indestructible black bar in tx that someone reported earlier, click control-left button twice on an opening parenthesis and then clear the window. 258. Date: Tue, 15 Aug 89 11:48:04 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: mx problems My mx window died with the following error: hijack<jhh 291> X Error: bad request code Request Major code 162 Request Minor code ResourceID 0xe0012 Error Serial #1349 Current Serial #1488 I don't know what I was doing at the time -- I think I was trying to scroll up with a lot of stuff selected on the current screen. Also, sometimes when mx starts up on a ds3100 I get just the frame of the window with no contents. It doesn't get filled in for at least 10 seconds, although if I click the mouse in the window it gets filled in immediately. 259. Subject: kgdb bug Date: Tue, 15 Aug 89 12:33:29 PDT From: Fred Douglis <douglis> after reading a new symbol table i was not able to call functions. i had to exit and restart gdb instead. 260. Date: Tue, 15 Aug 89 12:47:48 PDT From: mgbaker (Mary Gray Baker) Subject: compat error message All of a sudden, commands in two windows hung. One was an ls and the other was a "msgs". After about a minute, they both finally said "compat: Cannot decode user status value 0xffffffff" The ls finished, but the messages kept printing it over and over, slowly. My syslog window repeatedly said: RpcDoCall: <open> RPC to oregano is hung <open> RPC exit 0xffffffff 261. Date: Tue, 15 Aug 89 14:51:03 PDT From: brent (Brent Welch) Subject: Mint crash Aug 15 Mint crashed after recieveing the wrong reply messasge from Oregano. It hit a bug error in FsSpriteOpen, the client-side RPC stub. The return packet seemed garbled, and in fact it turned out to be the reply packet for a stat RPC, not an open. Oregano was being sluggish in responding to RPCs (a sign of a network interface that needs to be reset), and when mint retransmitted a request Oregano responded with the incorrect reply. Oregano seemed to resend a stat reply with the message ID and command field associated with an open RPC. The bogus reply was followed immediately by the transmission of the good open reply. This means that the scatter gather mechanism in the interface took the RPC header from one packet and the parameter block from another (just a theory). The trace went something like: Open request retransmitted by mint (flags == Qp) Open reply with parameter block from a stat Open reply with good parameter block >From kgdb you can dump the RPC trace with (kdbg) print Rpc_PrintTrace(50) >From the console keyboard you can reset a Sun3's network interface with L1-n. Before this problem I noticed several complaints from the nfsmount processes on Oregano about RPC timeouts to the NFS server. Anyway, there are a number of possible things to do, beginning with nothing. Beyond that, sanity checks can be added to all RPC stubs, which is probably a good idea, although it will add overhead. Finally, we could periodically reset the Intel ethernet interfaces, which apparently have a reputation for being flakey. Currently the RPC system will do the reset when it recieves apparently garbled packets, but that didn't kick in this case. brent ps. This isn't the first time Oregano's ethernet interface has acted up and returned bogus packets to clients. 262. Date: Tue, 15 Aug 89 15:30:22 PDT From: shirriff (Ken Shirriff) Subject: more on ds3100 If I do "more" on a file, then do a search with "/" for something that isn't in the file, I get "Pattern not found" and then "Segmentation violation". 263. Subject: more unrepeatable ds3100 errors Date: Tue, 15 Aug 89 15:45:36 PDT From: Fred Douglis <douglis> for the record: cd /a/attcmds/more; pmake newtm resulted in the complaint "userMap: undefined variable" and no md.mk file being created. A second mkmf worked fine. cd kernel/fs; rm ds3100.md/*.o; pmake done this morning to 'show off' to bks. all i did was show off how sprite is flaky, because one of the compilations returned with exit status 1 even though no error messages were produced and a second make worked just fine. 264. Date: Tue, 15 Aug 89 19:31:29 PDT From: mendel (Mendel Rosenblum) Subject: bug in sun3/sun4 timer code. The Sprite timer code on the sun3 and sun4 doesn't handle the case of the chip running backwards. This causes the gettimeofday() on the sun3 and sun4 to sometime run backwards. The chip seems do two things wrong. 1) The hundredths registers sometimes reads out values greater than 99. I have seen values as great as 127 come out. This causes the time returned to be unnormalized because it has 1,000,000 microseconds. This seems pretty easy to dectect and fix. 2) Other times the hundredths appears to jump forward and settle back again. I've seen the hundredths register go (31, 62, 32, 33) on successive reads. This seems harder to dectect and fix. It appears that one can not trust the timer chip to keep track on time of day on a find grain. Any suggestions on how to get around this problem? The easiest fix I can think of is to just prohibit time from ever go backwards. 265. From: rab (Robert A. Bruce) Subject: trashed file Date: Tue, 15 Aug 89 20:11:56 PDT /sprite/src/daemons/ipServer/RCS/stat.h,v is trashed. I moved it to /sprite/trashed. I will try and restore the file from a dump tape. 266. Date: Wed, 16 Aug 89 17:24:19 PDT From: shirriff (Ken Shirriff) Subject: ds3100 man problem Running "man command" where command.man is a new man page that hasn't been nroffed yet yields "sh: nroff: not found". 267. Date: Wed, 16 Aug 89 17:43:38 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: mx selection problem Here is the scenerio. I rlogin to a sun3 from a ds3100. I run mx on the sun3 such that it is displayed on my ds3100. I select something not in the mx window. I try to paste it into the mx window. The problem: After a long pause I get: Tried to use selection, but nothing's selected. I can now have two selections on my screen -- on in the mx window and one in any other window. None of the windows will recognize the "enemy" selection. 268. Date: Wed, 16 Aug 89 17:57:36 PDT From: shirriff (Ken Shirriff) Subject: ds3100 dbx dies dbx bombs out on me and leaves me with dbx: internal error: pwait: pid 591408 not found after I set a bunch of breakpoints. The sequence of events is in ~shirriff/dbxbug. 269. From: rab (Robert A. Bruce) Subject: ipServer on allspice Date: Wed, 16 Aug 89 23:09:56 PDT Allspice's ipServer crashed. I tried to debug it, but it died before I could get a stack trace. There was a suspicious message on the console: Intel: spurious interrupt (2) but I don't know if it is related. I put a copy of `restartservers' in /hosts/allspice. 270. From: rab (Robert A. Bruce) Message-Id: <8908170626.AA855356@sprite.Berkeley.EDU> To: bugs Cc: rab Subject: rlogind Date: Wed, 16 Aug 89 23:26:05 PDT I opened an allspice window, set the termcap and then typed `clear'. I got the following message: PdevServiceRequest, bad request magic # 0x31c1113 The window froze up and rlogind was in the debugger. I opened another window, and tried the same thing. It didn't work, so I typed `exit'. The window froze and a second rlogind was in the debugger. 271. Subject: DEFTARGET bug Date: Wed, 16 Aug 89 23:29:46 PDT From: Fred Douglis <douglis> this has been brought up before: TM defaults to sun3 if not set explicitly. I have "TM=$MACHINE" in my PMAKE environment variable and it's worked well for me. John H. doesn't and he was not able to do mkmf using the modified tm.mk because TM was set explicitly to sun3 even though the target was really "dependall" and TM didn't matter. I'd like to change all references of the form TM ?= @(DEFTARGET) to TM ?= $(MACHINE) any problems with this? 272. Date: Thu, 17 Aug 89 00:22:15 PDT From: shirriff (Ken Shirriff) Message-Id: <8908170722.AA984622@sprite.Berkeley.EDU> To: bugs Subject: ds3100 nroff bug The problem with nroff occurs in the environment saving function caseev. A bunch of variables are defined in ni.c: int block = 0; int ics; int icf; ... etc ... int *hyptr[NHYP] = {0}; ... etc ... Then caseev does read(.., (char *)&block, LENGTH_OF_EVERYTHING), which is supposed to read in all these variables in one fell swoop. However, this assumes the variables are stored consecutively, which they are on the sun. However, on the ds3100, the initialized arrays are put before everything else, so the reads and writes are modifying the wrong variables. 273. Date: Thu, 17 Aug 89 08:30:54 PDT From: ouster (John Ousterhout) Subject: Re: rlogind The rlogind bug Bob reported sounds just like a bug Mike found in the ipServer, where the kernel was reporting more data in the pdev request buffer than was really there, causing the server process to try to handle an extra request. The ipServer also died with a bad magic number. Since Brent was away on vacation, Mike just patched the ipServer to ignore bad requests. I think that the problem is pretty reproducible on ds3100's: just take the patch out of ipServer and try to run X. 274. Date: Thu, 17 Aug 89 09:12:58 PDT From: ouster (John Ousterhout) Subject: Bad News on Dinner It appears that Mint was inaccessible through the network all yesterday afternoon and night. Martha Zimet came by late yesterday afternoon to say she hadn't been able to login to Mint all afternoon. I was able to rlogin from mace, so I didn't look any further. However, this morning she was still unable to login. I went upstairs and restarted all Mint's daemons, which fixed the problem. Portmap had been in the debugger. In my haste to get things going for Martha I just killed it. In retrospect I should have taken a look with the debugger.... sorry about that. What is portmap, anyway? Mint was refusing rlogin's and rsh's, but honoring pings and rcp's. There were no network daemons running on Allspice this morning either. I restarted them. By the way, mail apparently wasn't getting through yesterday either: once I restarted the daemons, a flood of day-old internet mail arrived for me. 275. Date: Thu, 17 Aug 89 09:39:40 PDT From: ouster (John Ousterhout) Subject: /tmp disappeared again After Oregano's crash and reboot this morning, /tmp was gone again. I added back the symbolic link to /c/tmp. I'm beginning to suspect that Oregano's boot scripts are responsible for this. 276. Date: Thu, 17 Aug 89 09:44:27 PDT From: mendel (Mendel Rosenblum) Subject: someone broken mkmf on ds3100 When I try to mkmf a directory on a ds3100 I get the message "/sprite/lib/pmake/tm.mk", line 91: Undefined variable "$(" Fatal errors encountered -- cannot continue Sure enought, line 91 on of tm.mk is syntax_error: $( I have commented this line out so I can do mkmf. 277. Date: Thu, 17 Aug 89 10:01:19 PDT From: mendel (Mendel Rosenblum) Subject: oregano crash Oregano crasshed this morning with a bus error in FsWriteBackDesc(). It looked like FsDomainFetch() must of returned a bad domain pointer. 278. Date: Thu, 17 Aug 89 11:39:50 PDT From: ouster (John Ousterhout) Subject: Pmake lost characters I just did a "pmake install TM=sun3" in kernel/dev, and at the very end of the pmake the following output occurred: ... devTty.c: mv llib-ldev.ln sun3.md/llibrm -f sun3.md/llib-ldev.ln usage: mv [-if] file1 file2 or mv [-if] file/directory ... directory *** Error code 1 pmake: 1 error I reran the pmake, and it then worked OK, producing the following output: ... devTty.c: mv llib-ldev.ln sun3.md/llib-ldev.ln --- ../Lint/sun3.md/dev.ln --- rm -f ../Lint/sun3.md/dev.ln /sprite/cmds.sun3/cp sun3.md/llib-ldev.ln ../Lint/sun3.md/dev.ln It looks like command lines from two different targets may have gotten scrambled together. As I remember, this is similar to the problems people have been having on the DS3100s, but this particular example was on a Sun-3. 279. Date: Thu, 17 Aug 89 11:40:56 PDT From: brent (Brent Welch) Subject: Re: rlogind There were several rlogind in the DEBUG state. Each one seemed to die in a different spot. gdb also died after looking around a little bit. I think rlogind memory image got trashed, and I suspect the cache-write back problem that allspice had a couple days ago. We continued allspice after a cache write back protection error, and rlogind ended up in the debugger at that time. Perhaps the cached page table for rlogind has a bogus value, so any rlogind will eventually die? There are probably ways to flush segments and check this, but I don't remember them. 280. Date: Thu, 17 Aug 89 12:15:11 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: merge, rcsmerge Neither of these will work on a ds3100 because they depend on /sprite/lib/$TM.md/diff3. This is one of those cases where unix has a shell script front end to the real program. Our diff3 is from GNU and doesn't have the back end. Merge uses the backend directly so we're hosed. I don't see why merge can't go through the front end -- I'll look into it. 281. Date: Thu, 17 Aug 89 14:10:39 PDT From: pmchen (Peter M. Chen) Subject: using news to send mail Doesn't change the machine name to sprite. I guess it doesn't use the same sendmail program. E.g. From: pmchen@mustard.Berkeley.EDU (Peter M. Chen) 282. Date: Thu, 17 Aug 89 14:49:47 PDT From: pmchen (Peter M. Chen) Message-Id: <8908172149.AA339000@sprite.Berkeley.EDU> To: bugs Subject: program running when sun4 crashed I was on raid, running a program which started up a lot of processes talking to one disk, and it crashed (see Rich Drewes's soon to be ensuing message, or previous message, depending on who mails first). I was running the following program: mult4 /dev/rsvj00 600000 type/1 size/1 0 20 20 0 0 10 I've run the same program other times without crashing. One stress on the system might be the number of processes (20) forked off. We'll try to repeat the crash...more later 283. Date: Thu, 17 Aug 89 14:57:58 PDT From: drewes (Richard Drewes) Subject: Sun 4 bug hi hi hi, raid, a Sun 4 gets occasional hard crashes that necessitate a power cycle (watchdog reset results in a permanently blank screen). The console error message is: MachPageFault: Bus error in user proc 31e12, PC = 9424, addr = 4 BR Reg 80 Fatal Error: Mem_Free: storage block already free Entering debugger with a Interrupt Trap (16) exception at PC 0xf607e6f0 Peter Chen is sending you the code that generated the error. Another, possibly related error I have encountered is not quite as fatal: it just prints a segmentation fault sometimes when I manipulate large blocks of malloced data (like 100KB). Thanks for your attention, O Sprite God. 284. Date: Thu, 17 Aug 89 19:47:03 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: problem with sun4.md/machConst.h The sun4 version of machConst.h redefines a bunch of sys variables (like SYS_NUM_SYSCALLS). This isn't very convenient for adding new system calls. 285. Subject: rcs check-out error Date: Thu, 17 Aug 89 22:42:22 PDT From: Fred Douglis <douglis> i tried to check out mach/ds3100.md/machAsm.s. I got: co -l machAsm.s RCS/machAsm.s,v --> machAsm.s revision 1.5 (locked) co error: Can't check-out new copy of machAsm.s. Old copy saved. Since i was replacing the file anyway, i moved the RCS file to machAsm.s.bak,v and then just recreated machAsm.s with the copy mike sent me. so much for source control... 286. Date: Fri, 18 Aug 89 09:25:52 PDT From: mendel (Mendel Rosenblum) Message-Id: <8908181625.AA332066@sprite.Berkeley.EDU> To: bugs Subject: mkmf defaults problem It use to be that when you typed mkmf and had only one ".md" directory that machine type would be the default. Someone has changes this. Now it sets TM to default to $MACHINE. When I type pmake on the sun3 with only a sun4.md directory I get the following output: murder% pmake --- sun3.md/fs.new.o --- rm -f sun3.md/fs.new.o ld -r -o sun3.md/fs.new.o ld: no input files *** Error code 1 pmake: 1 error I like the way it was before better. 287. Subject: ds3100 crash starting recovery Date: Fri, 18 Aug 89 14:53:48 PDT From: Fred Douglis <douglis> the moment kvetching enabled RPCs it died by jumping to pc 0. This was after printing that it was starting recovery with mint. Looks like it got something bad from mint that it didn't protect itself against when going through its jump tables. 288. Subject: Re: ds3100 crash starting recovery Date: Fri, 18 Aug 89 15:01:02 PDT From: Fred Douglis <douglis> Actually, a more precise description of the bug, now that i realize what happened. I had a "cat /hosts/kvetching/dev/syslog" running on mint in order to tweak recovery when kvetching was down. The crash was repeatable when the cat was running, and kvetching booted just fine once i killed the cat process. 289. Date: Fri, 18 Aug 89 15:43:55 PDT From: ouster (John Ousterhout) Subject: Bug: processes not dying I've been having a lot of trouble lately with processes not dying, either when I type "kill" to gdb, or when gdb exits. About half the time gdb just hangs until I type "killdebug" in another window (thank-you Ken for this convenience). In the past the processes have occasionally not died, but it's never hung gdb like this before. 290. Date: Sat, 19 Aug 89 10:24:58 PDT From: ouster (John Ousterhout) Subject: Piracy in debugger again I'm beginning to wonder if maybe something is wrong with Piracy, since it ends up in the debugger so much more often than other DS3100's, even though I'm not actually using it. Right now it's in the debugger with the message "Bad kernel TLB Fault Syncing disks ... Entering debugger with a TLB LD miss exception at PC 0x8" 291. Date: Sat, 19 Aug 89 10:53:21 PDT From: gibson (Garth Gibson) Subject: MachTrap in tx on default kernel (Brent sun3) (8 Jul 89 18:49:45) I was running vi in a tx window this morning (on the oldest kernel I can find - the one that generally runs forever) and the tx process took a bus error: MachTrap: Bus error in user proc 4051f, PC = dad4, addr = 2a2f0a84 BR Reg 0 garth 292. Date: Sun, 20 Aug 89 10:39:14 PDT From: brent (Brent Welch) Subject: X on ds3100 I tried to use cardamom today, Sunday. After finally finding /ultrix/cmds/Xmfb I invoked it via xinit. xinit tx -D -title Console -e ~/bin/xstart sprite:0 -- /ultrix/cmds/Xmfb The backgroud pattern appeared for about two seconds and then the screen went blank. I am currently logged into cardamom and see no trace of xinit or Xmfb, but I can'T use the screen. Is this a case of not being able to restart X because of an interaction with the ipServer? By the way, what is the one true way of starting X on a ds3100? Why isn't it easy to figure out? Also, the xinit I started was probably /X/cmds.ds3100/xinit, not the one in /ultrix/cmds. 293. Date: Mon, 21 Aug 89 10:09:37 PDT From: mendel (Mendel Rosenblum) Subject: loadavg error messages I've been getting messages of the form: <27>Aug 21 10:07:20 loadavg[11118]: Error evicting foreign processes: an argumen t to a call was invalid on murder. The kernel is: SPRITE VERSION 1.0 (JohnH sun3) (11 Aug 89 17:57:30) 294. Date: Mon, 21 Aug 89 10:10:25 PDT From: ouster (John Ousterhout) Subject: Bug in mx regexp search code If you select the last character in a file, enter a garbage string into the search window (one that won't match anything) and type ^B, the regexp code panics with "Pointer error!". 295. Subject: trashed file Date: Mon, 21 Aug 89 10:33:02 PDT From: Fred Douglis <douglis> /user1/douglis/Mail/drafts/1 should have contained a Mail draft that I was trying to save last night when allspice must have crashed. Instead, it contained something that looks like part of an mx log for a file called "versions". I moved it to /user1/trashed/MH-mxlog. 296. Date: Mon, 21 Aug 89 10:55:31 PDT From: pmchen (Peter M. Chen) Subject: lprm dies The printer in our office (508-5), pulla, was not printing, so I tried to lprm a job. lprm -Plw547 (nobody has changed the name of the printer from 547 to 508-5) <jobnumber> returned: *** compat: Invalid message # for Gen module: status = 0x4e22 *** compat: Invalid message # for Gen module: status = 0x4e22 socket: Can't find my hostname Debug This might be because envy is currently down and is returning a weird error message. 297. Subject: tx bug: large selection hung window Date: Mon, 21 Aug 89 11:44:43 PDT From: Fred Douglis <douglis> I tried using ^V to stuff a very large selection, and my tx hung. It's process 70216 on kvetching if someone wants to look at it (I threw it into the debugger). 298. Date: Mon, 21 Aug 89 13:15:01 PDT From: ouster (John Ousterhout) Subject: Bug: xinit needs to be tuned for Sprite From stolcke@icsib8.Berkeley.EDU Mon Aug 21 11:59:04 1989 Received: from icsib.Berkeley.EDU by sprite.Berkeley.EDU (5.59/1.29) id AA08262; Mon, 21 Aug 89 11:59:02 PDT Received: from icsib8. (icsib8.Berkeley.EDU) by icsib.Berkeley.EDU (4.0/ SMI-4.0) id AA00269; Mon, 21 Aug 89 11:59:13 PDT Received: by icsib8. (4.0/SMI-4.0) id AA15809; Mon, 21 Aug 89 11:59:08 PDT From: stolcke@icsib8.Berkeley.EDU (Andreas Stolcke) Message-Id: <8908211859.AA15809@icsib8.> To: ouster@sprite.Berkeley.EDU (John Ousterhout) Subject: Re: Anyone use these things? In-Reply-To: Your message of Fri, 18 Aug 89 15:21:57 -0700. <8908182221.AA203572@sprite.Berkeley.EDU> Date: Mon, 21 Aug 89 11:59:06 PDT Yes, xinit it supposed to give a basic X startup. It should also do so when called without any arguments. I think this currently isn't the case on Sprite for to reasons: xinit expects 'X' to be a link to the local X server binary, which is then used as the default server. So in Sprite, 'X' should probably point to 'Xsprite'. xinit invokes 'xterm' as the default terminal emulator client. But since a bunch of options go along with this a link from 'xterm' to 'tx' won't do. Off hand I can think of at least three ways of fixing this: either make the option handling in tx a superset of xterm's, or change the default command line compiled into xinit, or write up a shell script that (sort of) emulates xterms options calling tx. 299. Subject: access times Date: Mon, 21 Aug 89 13:49:51 PDT From: Fred Douglis <douglis> looks like access times for binaries are updated only sporadically. if i do an ls, and then "ls -lu /bin/ls" it looks like it is getting updated. but if i do some other things and then ls -lu on them, they aren't updated. (a particular example is /sprite/cmds.ds3100/mh/scan). 300. Date: Mon, 21 Aug 89 14:32:47 PDT From: ouster (John Ousterhout) Subject: Can't compile for DS3100 I've been trying to recompile Mx and Tx for the ds3100, but I keep getting messages like "ld: Can't locate file for: -ltcl_g with -B1.31". Does anyone know what this error message means? The file seems to exist in /sprite/lib/ds3100.md/libtcl.a. 301. Date: Mon, 21 Aug 89 15:14:33 PDT From: brent (Brent Welch) Subject: Re: access times of binaries This is the situation that caused some confusion on Fred's part. If a program is being executed then the system always returns "now" as the current access time. This is done to avoid the overhead of contacting all the hosts that might be executing the program. However, this access time is not propogated back to the file descriptor (bug #1). So, if you use the ls program to look at the access time of the ls program, you'll always get "now". If you use another program, stat for example, you'll get the access time in the file descriptor. A related bug is that (I think) demand loading a file from a remote server might take a path that doesn't update the access time on the binary file. Specifically, FsCacheRead updates the access time, but FsCacheBlockRead does not. Normally FsCacheRead is called on the client and FsCacheBlockRead is called on the server in response to requests for whole blocks. For non-VM accesses FsCacheRead updates the access time at the client cache, and this eventually gets back to the file server. However, VM uses Fs_PageRead and the object-specific BlockRead routines, which do not properly set the access time on the client. 302. Date: Mon, 21 Aug 89 17:43:50 PDT From: mgbaker (Mary Gray Baker) Subject: L1 keys on sun4 A while back there were some complaints about the L1 keys not working at times on the sun4. It turns out that some people's mainHook.c files (not mine, of course, or I would have experienced the same problem!) set main_DoDumpInit to FALSE. If it is false, the routines for the different L1 keys are not initialized. I don't know why this is a variable, or why anyone would want to set it to false, but this explains why various people's sun4 kernels had this trouble. Another complaint is that L1A won't work sometimes. If the machine has wedged itself at a time when interrupts were off, then nothing from the keyboard will work. At that point, you need to watchdog reset it or the equivalent. If a machine does this, this is a bug, since it should not have wedged itself, naturally. This can happen easily if you are debugging a sun4 kernel and the debugger protocol messes up and it starts timing out. Fixing the debugger will help, and figuring out a way to re-enable keyboard interrupts in the debugger will also help. Both should happen eventually. 303. Date: Mon, 21 Aug 89 21:38:04 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: bug in mx I'm running the new mx (Aug 21 14:34) on a ds3100 running kernel 1.002 and every time I select something and type ^B I get the message "Pointer error!" in the shell that started the mx, and the mx goes into the debugger. 304. Date: Tue, 22 Aug 89 10:17:22 PDT From: mendel (Mendel Rosenblum) Subject: Re: bug in mx > I'm running the new mx (Aug 21 14:34) on a ds3100 running kernel 1.002 > and every time I select something and type ^B I get the message > "Pointer error!" in the shell that started the mx, and the mx goes into > the debugger. Tx does the same thing when you do a meta-b. 305. Date: Tue, 22 Aug 89 12:21:32 PDT From: gibson (Garth Gibson) Message-Id: <8908221921.AA267821@sprite.Berkeley.EDU> To: bugs Subject: ds3100: spritemon spritemon with no args works but: spritemon -ufv%iH 35 Bad user TLB fault in process 31619: pc=401904 addr=4 Segmentation violation 306. Subject: bug: ds3100 tx garbage pointer Date: Tue, 22 Aug 89 12:45:15 PDT From: Fred Douglis <douglis> I was trying to select something and hit the debugger with the following stack. mxwPtr is garbage. > 0 CharToLine(mxwPtr = 0x205d676e, position = (...)) ["mxDisplay.c":888, 0x417 1d8] 1 MxRedisplayRange(mxwPtr = 0x205d676e, first = (...), last = (...)) ["mxDisp lay.c":1270, 0x417ebc] 2 MxHighlightSetRange(hlPtr = 0x1002eb20, first = (...), last = (...)) ["mxHi ghlight.c":234, 0x414b5c] 3 MxMarkParens(fileInfoPtr = 0x1001b640, position = (...)) ["mxCmdUtils.c":58 6, 0x413014] 4 .block79 ["mxCmdUtils.c":778, 0x4135d0] 5 MxMouseProc(mxwPtr = 0x10025168, eventPtr = 0x7fdff818) ["mxCmdUtils.c":778 , 0x4135d0] 6 Sx_HandleEvent(eventPtr = 0x7fdff818) ["sxDispatch.c":442, 0x42032c] 7 Tx_WindowEventProc(display = 0x10017bb8) ["txWindow.c":1240, 0x403e74] 8 .block249 ["fsDispatch.c":328, 0x44bf40] 9 Fs_Dispatch() ["fsDispatch.c":328, 0x44bf40] 10 .block1 ["tx.c":135, 0x4004cc] 11 main(argc = 9, argv = 0x7fdffa14) ["tx.c":135, 0x4004cc] 307. Date: Tue, 22 Aug 89 12:47:18 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: bug in mv There is a bug moving a file to a symbolic link to itself. For example, I created the file /tmp/foo, and the symbolic link /sprite/tmp/foo -> /tmp/foo. I then did "mv /tmp/foo /sprite/tmp/foo". I get the message "mv: /tmp/foo: rename: invalid argument", and worst of all, the file /tmp/foo disappears. 308. Date: Tue, 22 Aug 89 13:06:39 PDT From: gibson (Garth Gibson) Subject: ds3100 from vi i issued ":e ~/bin/xstart" and it failled with message "/sprite/cmds.sun4/csh" exec format errorNo match 309. Date: Tue, 22 Aug 89 14:06:41 PDT From: gibson (Garth Gibson) Message-Id: <8908222106.AA66877@sprite.Berkeley.EDU> To: bugs Subject: restarting x11 on the ds3100 doesn't work - it goes into a loop waiting for server to start. although once a user is established this is not a problem, every newuser is going to die trying to get his x environment right 310. Subject: Re: ds3100: spritemon Date: Tue, 22 Aug 89 14:10:49 PDT From: Fred Douglis <douglis> I went up to Garth's office to see why spritemon died for him but not for me. It was dereferencing a null pointer to font info because the font it tried to open didn't exist. It should complain that it can't open the font. Looks like this is actually an X toolkit problem, so I don't know how it would be fixed or by whom.... 311. Subject: nfsmount bug Date: Tue, 22 Aug 89 14:28:37 PDT From: Fred Douglis <douglis> oregano's mount of /chip went into the debugger. the gcore file is in /tmp/nfsmount.core.22628 if anyone wants it. not only did operations on chip hang, but garth said that his Mail process got hung reading mail on sprite. 312. Date: Tue, 22 Aug 89 14:52:16 PDT From: gibson (Garth Gibson) Subject: ds3100: X meta key develops a "lock" mode I don't know how, but I got into a state on the ds3100 in X where meta Press and Release events alternated each time the key was pressed (but didn't when it was released). Caused some funky behaviour when I started typing in a tx window with the meta key locked on! I tore down x, (killed the server processes, restarted the servers, restarted x, and it was better. Got to be those alpha particles! garth 313. Date: Tue, 22 Aug 89 21:02:16 PDT From: mgbaker (Mary Gray Baker) Subject: assembler bug When assembling sparc code on a sun3, the assembler gets a bus error if you have a "bnz" instruction. This is a synonym for bne, and it isn't implemented, but this shouldn't cause a bus error. The assembler should report "unknown opcode" or such. 314. Date: Tue, 22 Aug 89 21:14:33 PDT From: gibson (Garth Gibson) Subject: sun3 gdb sometimes gdb on the sun3's hangs when i tell it to kill the program it is debugging. if i kill the process in another window, it proceeds just peachy keen 315. Date: Wed, 23 Aug 89 09:03:54 PDT From: ouster (John Ousterhout) Subject: No warning about disk full? I don't seem to be getting syslog warnings about disk partitions filling up anymore. I do get error returns in programs, such as "Couldn't open "sun4.md/mx": no space left in file system domain." But cache write-backs don't cause error messages. Is this intentional? I'm not sure it's good. 316. Date: Wed, 23 Aug 89 10:18:35 PDT From: brent (Brent Welch) Subject: Oregano ipServer crash Oregano's ipServer died in CallTimeoutHandler. Its timeoutList seemed ok, but a pointer that it used was bogus, readyPtr. This is an element it plucks from the list, so somehow it got confused. 317. Date: Wed, 23 Aug 89 13:14:25 PDT To: bugs Subject: Crashes leave display off It appears that Sprite doesn't turn on the display when it enters the debugger or exits to the boot ROM. This makes it somewhat harder to figure out what has happened when a machine crashes (Piracy always seems to crash when it's in screen-saver mode). I think this was a problem on Suns too. Seems like it ought to be easy to fix. 318. Subject: mach/ds3100.md/md.mk screwed up Date: Wed, 23 Aug 89 14:04:20 PDT From: Fred Douglis <douglis> All of its sources are for jhh.md instead of ds3100.md. 319. Subject: ds3100 doesn't sync clock Date: Wed, 23 Aug 89 15:32:40 PDT From: Fred Douglis <douglis> if a machine is in the debugger it doesn't increment its time of day clock, nor check it against reality. kvetching is now 15 minutes slow. 320. Date: Wed, 23 Aug 89 16:18:45 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: XCFLAGS funniness The lock tracing stuff wasn't getting turned on in the mem module and I traced it to the XCFLAGS. My kernel.mk file adds -DLOCKREG to XCFLAGS, and the local.mk file in mem adds -DMEM_TRACE. If I go to mem and type 'pmake spur' only the -DMEM_TRACE shows up, but if I type 'pmake TM=spur' they both do. Is there a pmake expert out there who knows why this is happening? 321. Subject: new hosts not being setup properly Date: Wed, 23 Aug 89 17:06:32 PDT From: Fred Douglis <douglis> for example, /hosts/{pepper,parsley,violence}/dev/syslog doesn't exist, and /hosts/pepper/dev doesn't even exist. wall complains. 322. Date: Wed, 23 Aug 89 17:13:17 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: NET_NUM_SPRITE_HOSTS is bogus The number of sprite hosts should not be a constant that is compiled into user-level programs. I think this should be obtained through a system call. This would allow us to simply restart programs when we change the maximum, rather than recompiling libc.a and therefore the world. Right now pmake is broken, making it difficult to recompile a new pmake. 323. Date: Wed, 23 Aug 89 17:17:24 PDT From: douglis (Fred Douglis) Subject: host bug Turns out it wasn't the addition of host #50 that broke loadavg, it was the addition of a blank line. I'm changing Host_Next to deal with it, but no one should insert a blank line in spritehosts until this change propagates to every program. 324. Date: Wed, 23 Aug 89 18:03:21 PDT From: brent (Brent Welch) Subject: Oregano deadlock Oregano ran into a deadlock having to do with a call-back to a client during a file remove. I thought there was only one place that is used to wait for call-backs to complete, so I stuck my timeout handler at that spot. Unfortuneatly this other case slipped past me, so Oregano wedged after it apparently dropped a "consistency completed" RPC from thyme. I'll fix up the code so all client callbacks are guarded with a timeout. By the way, it is still concievable that this was due to a larger, network-side deadlock problem, especially because Oregano's disks were filling up. brent PS. Mint was rebooted at the same time to get it running the latest sun3.new. Both machines had been up for almost 6 days! 325. Date: Wed, 23 Aug 89 19:34:09 PDT From: brent (Brent Welch) Message-Id: <8908240234.AA336200@sprite.Berkeley.EDU> To: bugs Subject: silent printer errors I still hate the printing system. I often have jobs that abort silently. I don't really care about any of the underlying problems, I just want a user-friendly system. brent ps. The files /sprite/spool/lpd/lw608-2/{ErrorLog,lw608-2-log} contain no useful information about this particular case. 326. Date: Wed, 23 Aug 89 13:28:31 PDT From: jhh (John H. Hartman) Subject: gdb died on sun4 I was debugging the ipServer on allspice (kernel 1.002) and gdb died with the following message: "MachHandleWindowUnderflow: killing process!". 327. Subject: sun4 bug: allspice misbehaving Date: Thu, 24 Aug 89 10:48:21 PDT From: Fred Douglis <douglis> * allspice's ip server seems to crash much more often than on other machines. * allspice's "rup" entry truly is broken (unlike John's joke about mint & oregano). allspice- sun4 up 61+23:23 0.00 0.00 0.00 (idle 1+08:58:28) not only is the uptime off (which happens typically when rdate fails and the date isn't initialized, so /hosts/`hostname`/boottime is dated 1969, except that allspice's isn't), but allspice's count of migrated processes seems to be non-zero. 328. Subject: sun2 directories Date: Thu, 24 Aug 89 11:05:03 PDT From: Fred Douglis <douglis> someone moved (or removed) sun2.md all over the place, but at least some directories have not been remkmf'ed. So, "make all" generates a lot of complaints. Time for a world remkmf? 329. Date: Thu, 24 Aug 89 12:50:49 PDT From: mgbaker (Mary Gray Baker) Subject: Re: gdb died on sun4 This isn't a bug, although the error message should be improved. This is what happens to user processes that mess up their stacks (unalign them or garbage them) and then get an underflow. It's sort of like a bus error or something, except that your choices of how to handle it inside an underflow trap are very limited. I'll make the error message more informative. This does prove, though, that the watchdog reset is gone, since it used to get a watchdog reset when it tried to print the string. I fixed that problem, which is why you now see this message. 330. Date: Thu, 24 Aug 89 15:14:27 PDT From: eklee (Edward K. Lee) Subject: rlogin to cory sometimes hangs Often times we can not rlogin to tonkawa nor raid even though the ipServer and inetd are running and the other non-Sprite machines in Cory are accessible. Rlogin daemons are spawned off but seem to immediately enter the DEBUG state. In fact, it seems like two rlogin daemons are spawned for for each rlogin attempt. One of the daemons enters the DEBUG state and the other enters a wait state. 331. Date: Thu, 24 Aug 89 15:32:47 PDT From: eklee (Edward K. Lee) Subject: ditroff on ds3100 Running ditroff on a ds3100 results in: Bad user TLB fault in process 22b32: pc=40f6b4 addr=ffff436 being printed to the sylog and the process hanging. 332. Date: Thu, 24 Aug 89 16:39:47 PDT From: brent (Brent Welch) Subject: Attributes and devices Attribute handling is still not perfectly implemented. This summarizes what happens and what ought to change. 1 - If you stat() a file that is being executing, the kernel reports that the access time is "now". This time does not get propagated back to the file descriptor on disk, so the access time can appear to change 2 - While the device I/O servers maintain an access and modify time, this is not pushed back to the file descriptor. This means that only activity on mint's console will be remembered (maybe) 3 - Clients do not set the access and modify time when a file is created. The file server's time is used. A client does set a modify time when it closes a file, but the server will set the modify time of a write-through (non-cachable) file. The fix to this requires changing the RPC parameters to OPEN and WRITE to include a modify time, and to add an access time to the READ RPC parameters. 333. Date: Thu, 24 Aug 89 16:42:24 PDT From: brent (Brent Welch) Subject: Symbolic link format Sprite adds a null to the end of the file name stored in a symbolic link, while Unix does not. Also, there is no domain-specific SYMLINK operation. Instead, a symbolic link is created (a la mknod), and then a value is written using the domain-specific WRITE procedure. This means that you can create a Sprite-format symbolic link on a Unix file server via nfsmount, oops. This also means you can create zero-length symbolic links if the disk is full. 334. Date: Thu, 24 Aug 89 16:43:53 PDT From: brent (Brent Welch) Subject: Removes with disk full The file servers do not behave well when the disk fills up. In particular, removes seem to fail, or at least hang. I suspect that the cache gets completely dirty so that indirect blocks cannot be read in, and this hangs the remove which needs to read the indirect block. 335. Date: Thu, 24 Aug 89 16:45:25 PDT From: brent (Brent Welch) Subject: pseudo-device pointer bug On the ds3100 and sun4 machines there are occasional pseudo-device pointer errors. The firstByte index into the request buffer is not pointing to the required magic value. There is probably a bug relating to rounding sizes up to 4-byte boundaries. This is killing ipServer and rlogind processes. 336. Date: Thu, 24 Aug 89 16:46:12 PDT From: brent (Brent Welch) Subject: chmod symbolic link loop chmod 755 /sprite/src/kernel generates the error: too many levels of symbolic link while chmod 755 /sprite/src/kernel/ works ok. 337. Date: Thu, 24 Aug 89 16:55:30 PDT From: jhh (John H. Hartman) Subject: malloc semantics Malloc() should return NULL if more memory cannot be allocated. The current behavior is to kill the process. A variable should be provided that allows a process to make malloc behave in either fashion. 338. Date: Thu, 24 Aug 89 17:20:48 PDT From: brent (Brent Welch) Subject: migration offset The stream offset is probably being screwed up during migration. This can explain the problems with pmake's shell scripts getting apparently garbled. 339. Date: Thu, 24 Aug 89 17:22:18 PDT From: brent (Brent Welch) Subject: mail with no /tmp The previous empty mail message was generated when /tmp was down. I'm not sure this is worth trying to fix. However, my mail session looked like: <sage 208> mail bugs Subject: migration offset The stream offset is probably being screwed up during migration. This can explain the problems with pmake's shell scripts getting apparently garbled. 340. Date: Tue, 29 Aug 89 08:31:38 PDT From: ouster (John Ousterhout) Subject: Bug in wall Brent's wall message about Oregano going down did not ever appear on Mace's syslog (but I saw it on Piracy's console). Furthermore, the test wall message yesterday had the same behavior. For some reason wall must be stopping part-way through the list of hosts (an error of some sort?). 341. Date: Tue, 29 Aug 89 08:43:12 PDT From: brent (Brent Welch) Subject: syslog reopening Johns message about a bug in wall is really about a bug in reopening /dev/syslog. Mendel noticed yesterday that after he rebooted he couldn't cat /dev/syslog. This was due to a bug in the /dev/syslog reopen procedure. It always thought the device was being reopened for reading, which breaks things because it is a single-reader device. After a reopen, /dev/syslog could never be opened for reading. I've fixed this in my kernel (BW.106) and will install a new dev module. brent ps. (BW.106 is a sun3 kernel) pps. This also works around the bug described in #147 "device reopen bug" 342. Date: Tue, 29 Aug 89 09:31:14 PDT From: Fred Douglis <douglis> Subject: Re: syslog reopening if wall only made it part-way through, it could be related to the hanging rlogin pdev problem I reported yesterday. wall didn't used to try and open rlogins, which was a problem as well. by the way, brent, that explains the problem with murder: wall used to have a bug in which it wouldn't close any of its streams until it exited, so if it hung up in an unkillable state trying to open a pdev it would have references on all the syslogs it ever opened. i already fixed that and it's in the installed version. 343. Date: Tue, 29 Aug 89 10:42:23 PDT From: ouster (John Ousterhout) Subject: Piracy crash again Piracy crashed just now as I was attempting to rlogin from mace. The console message is: Bad kernel TLB fault Syncing disks. Version: SPRITE VERSION 1.002 (ds3100) (20 Aug 89 18:20:10) Entering debugger with a TLB LD miss exception at PC 0x800ab804 I'l leave the corpse around in case anyone wants to take a look at it. 344. Date: Tue, 29 Aug 89 12:16:17 PDT From: Fred Douglis <douglis> Subject: ds3100 rpn hex broken printing large numbers using rpn prints 7fffffff instead of 8nnnnnnn. BTW, Mike says this is broken under ultrix as well as sprite. 345. Date: Tue, 29 Aug 89 12:32:16 PDT From: Fred Douglis <douglis> Subject: Re: Piracy crash again i debugged it, then talk to mike. unfortunately, he only found what i found: that the tlb fault happened when a load used a register that had a zero value, except that register was the target of an add of non-zero values the previous instruction. Although the status register indicated interrupts were off, it's just too suspicious. I suggested that we put in a mousetrap to check for mach_NumDisableIntrs > 0 || sys_AtInterruptLevel when taking an interrupt. Mike: have you already made this change, or should I make a stab at it? (I'm a bit worried about using the wrong registers at the wrong time, which is why I ask.) 346. Date: Tue, 29 Aug 89 14:24:37 PDT From: rab (Robert A. Bruce) Subject: swap /sprite/lib/c/net/swap.c is screwed up. If the host machine is little-endian all the byte swapping appears to be correct. But if the host is big-endian the routines all return random garbage off of the stack. For big-endian machines the net swap routines should a macros that perform a nop, but if someone fails to include the header file the routines should still work correctly. A second problem is that there is no RCS file for swap.c. I will fix the swap routines and install the new version. 347. Date: Tue, 29 Aug 89 14:27:14 PDT From: mgbaker (Mary Gray Baker) Subject: Allspice crash Allspice crashed for the second time with a level 15 interrupt (asynchronous cache write-back error). This is totally disgusting, because the address it was trying to write back to was bogus (in the middle of the hole in the virtual address space). I don't yet even know how you could get an address like that into the cache in the first place. I'm currently investigating this. It's one bit different from a valid address in the intel page, but that's marked as non-cacheable. Anyway, if allspice crashes again with this, and I'm not here, could whoever debugs it please record the value of the global registers for me? Thanks. They contain interesting information on this kind of error. 348. Date: Tue, 29 Aug 89 15:16:16 PDT From: rab (Robert A. Bruce) Subject: readdir There was a bug in readdir() that caused it to swap bytes incorrectly. Because of this, programs that use readdir did not work correctly when accessing disks mounted on little-endian machines. I fixed the bug and I have recompiled `ls', `sh' and `csh', so they work correctly now. Other programs that use readdir still need to be recompiled. 349. Date: Tue, 29 Aug 89 15:31:06 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: thyme won't boot Thyme won't boot because mint won't answer it's broadcast for "/". Last Friday I changed the routing between mint and thyme to use the IP protocol in an attempt to debug what was happening over in cory. When I was done I ran 'netroute -f /etc/spritehosts' on mint, but this didn't fix things. 'netroute -p' prints out the correct routing information for thyme. 'rpcstat -trace' shows that mint thought it responded to the request. 'etherfind' does not show mint sending any IP packets. 350. Date: Tue, 29 Aug 89 16:23:06 PDT From: Fred Douglis <douglis> Subject: /tmp trashed?? try doing an "ls /c/tmp" on sun4s i get a segv. on a ds3100 i get "assertion failed: line 49 of readdir.c". on sun3s i get "dp->d_namlen <= 255" explicitly stated as an assertion failure. time to reboot oregano and check its disks??? 351. Date: Tue, 29 Aug 89 16:25:30 PDT From: brent (Brent Welch) Subject: sun4 compiler The sun4 cc1.space dies with error: ldexp when compiling in my ~brent/idleTime directory. I have had other successes with the sun4 compiler, however, so I encourage people to still try compiling things. brent ps. The file it fails on is print.c 352. Date: Tue, 29 Aug 89 16:25:41 PDT From: Fred Douglis <douglis> Subject: sun4 rdist missing the program doesn't exist. rdist.prog/sun4.md was empty except for md.mk, and when i tried to do mkmf, the makedepend went into an infinite loop. 353. Date: Tue, 29 Aug 89 16:31:00 PDT From: brent (Brent Welch) Subject: Re: ls problems in /tmp A new ls was installed today. It probably can't choke down something in /tmp. I don't think the directory is messed up. Let us debug ls first. 354. Date: Tue, 29 Aug 89 18:04:47 PDT From: mgbaker (Mary Gray Baker) Subject: prof file open bug Did I already report this? The prof module was opening the dump output file without the truncate flag, so crud could be left at the end of the file that gprof would die on. It's been fixed. Next time we do an install, everyone will see the fix. 355. Date: Tue, 29 Aug 89 18:22:50 PDT From: mgbaker (Mary Gray Baker) Subject: tftp daemon problem? I was unable to reboot anise because the tftp daemon wasn't running on mint. There was no daemon in the debugger, though. Would it just exit, or did somebody kill it? 356. Date: Tue, 29 Aug 89 18:30:47 PDT From: mgbaker (Mary Gray Baker) Subject: newly installed sun4 csh broken The newly-installed sun4 csh is broken. It dies when you try to login to a sun4, because it has a bad stack pointer. I would back it out, but it appears that whoever installed it overwrote the backup csh in the cmds.old area with the csh that causes ls to die on command completion sometimes. So, I guess I'll move csh to csh.bad and put a copy of the older bad csh in cmds.sun4 as the current csh. This at least will allow you to login. It's a pity that the person who installed it didn't try it out before installing it. With something as major as csh, this might be a good idea? 357. Date: Tue, 29 Aug 89 18:36:17 PDT From: mgbaker (Mary Gray Baker) Subject: Ugh, it's my fault Well, it seems I've done something totally bizarre to anise. The sun4 csh works just fine on allspice. I'll do some debugging and eventually maybe be able to remove my foot from my mouth. 358. Date: Wed, 30 Aug 89 09:47:16 PDT From: brent (Brent Welch) Subject: mint boots sprite on homer The folks in 608-1 complained that homer was running Sprite. Indeed, mint was beating ginger to the punch and supplying it with a Sprite kernel. How do we enforce control over tftp booting? Only with the symbolic links set up in /sprite/boot? If so, we should be careful about setting up links for not-normally-sprite-hosts in /sprite/boot. For now, I'm booting homer with ie(0,961c) to force it to run UNIX 359. Date: Wed, 30 Aug 89 10:33:55 PDT From: brent (Brent Welch) Subject: /sprite/boot up-to-date I went through /sprite/boot and removed a few symbolic links that correspond to machines no longer running sprite. This includes: homer (128.32.150.50 a.k.a. 80209632) turmeric (128.32.150.37 a.k.a. 80209625) bay (128.32.150.18 a.k.a. 80209612) tully (128.32.150.44 a.k.a. 8020962C) I also see a link for 80209C68.SUN4, which is an unused address on the 156 net (cory). This is probably for raid, but raid isn't in the host tables I see. There is also a link for 80209c95, which corresponds to ponca, except that the 'c' probably needs to be capatalized, and I don't know if tftp booting works through the gateway(s) or not. 360. Date: Wed, 30 Aug 89 10:36:09 PDT From: Fred Douglis <douglis> Subject: Re: /sprite/boot up-to-date this is rather awkward, since we might occasionally want to boot sprite on different hosts. perhaps we could have a script like the one on ultrix to add/remove hosts automatically. 361. Date: Wed, 30 Aug 89 12:49:20 PDT From: Fred Douglis <douglis> Subject: new xdvi installed... color support, but doesn't run native on ds3100 I picked up some patches from comp.sources.x for xdvi. It now runs on a sun3 using a color ds3100 display (it already worked for B&W). However, it doesn't run native on a ds3100 -- I presume it has byte-ordering problems. I'm inclined not to fix it, since I imagine someone else will as there have been regular updates. If I broke anything else with the new install, let me know. 362. Date: Thu, 31 Aug 89 00:01:41 PDT From: Fred Douglis <douglis> Subject: Re: xgone complaint well, the default is to prompt for a password, i guess. actually, i thought i'd changed that, but maybe not. i'll check. in any case, there's an option to disable it, and you should be able to rlogin and kill it, can't you? 363. Date: Wed, 30 Aug 89 14:19:07 PDT From: Fred Douglis <douglis> Subject: another ds3100 crash (malloc) piracy died with a bogus value (0x54) in its freelist. nothing too terribly obvious, except i did notice that cardamom had recently rebooted and it was in a migration-related call for something with home node cardamom. does anyone know what cardamom was doing when it was rebooted? i wonder if something got freed too soon, or something. p.s. dave culler said that piquante was just as unstable running ultrix as running sprite: xterm would die about once/day and the kernel itself would crash periodically. 364. Date: Wed, 30 Aug 89 17:19:46 PDT From: Fred Douglis <douglis> Subject: assault hardware problem? for the record: assault died a little while ago with something called a "bus error". however, the address was 0xc0c019d4, and 0xc0c019d0 and d8 were perfectly valid. apparently, a bus timeout can occur on a parity error, which is a possible cause of assault's problem. 365. Date: Wed, 30 Aug 89 22:26:27 PDT From: jhh (John H. Hartman) Subject: xgone complaint Several times I have been confronted by machines running xgone that insist I type in the password for the person that started it running. It would be nice if xgone could be killed or the password feature disabled. 366. Date: Thu, 31 Aug 89 09:48:49 PDT From: ouster (John Ousterhout) Subject: Another Piracy Crash This time the message was: Fatal Error: Page number outside bounds of corePtr->virPage.page table Syncing disks Version: SPRITE VERSION 1.002 (ds3100) (20 Aug 88 18:20:10) Entering debugger with a Breakpoint trap exception at PC 0x800bc6d8 The corpse is available for debugging. 367. Date: Thu, 31 Aug 89 11:02:48 PDT From: Fred Douglis <douglis> Subject: Re: Another Piracy Crash this is similar to a crash i looked at before: Vm_Clock was trying to clean a page that didn't belong in the segment it pointed to. The segment was an inactive code segment with 17 pages (16 resident), and corePtr->virtPage referenced page 1037. so for example, if the virtPage page number had an extra bit set accidentally, it could have really meant to reference 0xd instead of 0x40d and it would be a perfectly reasonable page. Just a thought... 368. Date: Thu, 31 Aug 89 11:20:10 PDT From: brent (Brent Welch) Subject: Too many system calls The "Too many system calls" should not be a panic, I think, because the problem occurs very early during bootstrap. Can't it just print out a warning and ignore the rest of the kernel calls? Date: Thu, 31 Aug 89 11:52:55 PDT From: ouster (John Ousterhout) Subject: The slows Something related to Sprite seems to have "the slows" this morning. I suspect Allspice, because that's where the files are that I'm compiling. The symptoms are that a compile takes a VERY long time, and the status line printed afterwards shows only 10-15% utilization of the CPU. Also, I've noticed occasional RPC timeout messages about allspice. The problem has come and gone a couple of times this morning. 370. Date: Thu, 31 Aug 89 12:21:49 PDT From: Fred Douglis <douglis> Subject: ds3100 ultrix weirdness a comment on *ultrix* weirdness. (i asked david if he's seen the same thing on piquante since it switched to sprite.) ------- Forwarded Message Date: Thu, 31 Aug 89 12:15:27 -0700 From: david@fennel.berkeley.edu (David A. Wood) To: root@fennel.Berkeley.EDU Subject: ds3100 (ultrix) NFS weirdness I have been running my cache simulator on greed and piquante and occasionally get some bogus results. They are very small errors, usually an extra line or two in the input file, but it is somewhat disconcerting. Has anyone else been experiencing these problems?? --david ------- End of Forwarded Message 371. Date: Thu, 31 Aug 89 12:30:10 PDT From: Fred Douglis <douglis> Subject: page-in error can kill kernel paprika crashed yesterday with a bus error. turns out Proc_Exec made an argument array accessible, then hit a bus error referencing it. this was about the time that allspice crashed, i got an "Fs_PageRead waiting" message, and i hit ^C to interrupt the exec. looks like Vm_MakeAccessible needs to lock down the page rather than relying on the same Vm_Copy check, since an error on page in has a choice of killing the kernel or returning something that will not be passed back to the routine accessing the data. at least, a page-in error is the only thing I can think of to account for the kernel dying. Suggestions from the VM experts?? 372. Date: Thu, 31 Aug 89 12:13:37 PDT From: mgbaker (Mary Gray Baker) Subject: Re: The slows Yesterday when I went to check on allspice's slowness, messages on the console showed it had been blasted with rpc version mismatches. This happened, it seemed, just when assault was booted (and unbooted quickly, since it died real soon). I reset the network interface and this helped to some extent, since mint could talk to allspice again where it hadn't just before. Maybe something worse is going on. Whatever it is, it only seems to pick on certain client machines at a time. 373. Date: Thu, 31 Aug 89 12:38:52 PDT From: Fred Douglis <douglis> Subject: Re: Too many system calls I've changed *.md/machCode.c to handle inconsistencies a little better. It prints warnings for too many system calls (ignoring the extra) or too many arguments (ditto). It also prints a warning for out-of-order call initialization. Mendel says that should be a panic still, but the problem (as we've seen) is that it's too early to panic. I'm open to suggestions. I'm recompiling now. 374. Date: Thu, 31 Aug 89 13:50:32 PDT From: mendel (Mendel Rosenblum) Subject: sprintf man page incorrect The man page for sprintf says: RETURN VALUE The functions all return the number of characters printed, or -1 if an error occurred. This is incorrect. 375. Date: Thu, 31 Aug 89 17:08:03 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: rcp hangs on ds3100 Rpc of a kernel from a ds3100 to dill hangs. 376. Date: Fri, 01 Sep 89 00:14:35 PDT From: rab (Robert A. Bruce) Subject: Allspice crashed Allspice crashed with the following error: Fatal Error: Page number outside bounds of pagetable Entering debugger with a Interrupt trap (16) exception at PC 0xf6081320 Jhh tried to debug it but couldn't because it was running sun4 instead of sun4.new and the sources aren't available. The bug is repeatable by running gdb and trying to stepi through an instruction that destroys the stack pointer. 377. Date: Thu, 31 Aug 89 20:10:27 PDT From: mgbaker (Mary Gray Baker) Subject: printenv doesn't take arguments On unix machines such as rosemary, printenv will take arguments so that you can say "printenv TERM" and get the answer "tx" rather than the answer "printenv doesn't take any arguments; "TERM .." ignored." and then your whole environment. 378. Date: Thu, 24 Aug 89 17:22:18 PDT From: brent (Brent Welch) Subject: mail with no /tmp The previous empty mail message was generated when /tmp was down. I'm not sure this is worth trying to fix. However, my mail session looked like: <sage 208> mail bugs Subject: migration offset The stream offset is probably being screwed up during migration. This can explain the problems with pmake's shell scripts getting apparently garbled. 379. Date: Thu, 24 Aug 89 17:24:40 PDT From: brent (Brent Welch) Subject: mail with no /tmp (This superceeds the previous message.) I tried to send mail while oregano was down. After I ended the mail session by typing . (on a line by itself :) I got: EOT Null message body; hope that's ok read: stale remote file handle And an empty message with no subject line was generated. brent 380. Date: Thu, 24 Aug 89 17:51:22 PDT From: brent (Brent Welch) Subject: Oregano hung for 5 minutes My consistency timeout kicked in today. The timeout period is 5 miniutes in order to allow a client with a large dirty cache plenty of time for a write-back. However, 5 minutes is enough time for everyone to think there is a major problem. I almost had Oregano in the debugger when the timeout message appeared on the console and things fixed themselves up quite nicely. How about a shorter timeout? 381. Date: Thu, 24 Aug 89 21:25:49 PDT From: rab (Robert A. Bruce) Subject: mkmf I tried to re-mkmf the library directory but mkmf generated bogus makefiles. Make issues the following complaints: "Makefile", line 29: Undefined variable "$ " "/sprite/lib/pmake/biglib.mk", line 64: Undefined variable "$ " ... "/sprite/lib/pmake/tm.mk", line 23: Undefined variable "$ " ... The offending line in the Makefile is: TM ?= $ {defTarget:q} At first I thought that there was just an extra space after the $, but when I removed it I got these messages: pmake: Unknown modifier 'q' "Makefile", line 29: Undefined variable "${defTarget:q}" pmake: Unknown modifier 'q' "/sprite/lib/pmake/biglib.mk", line 64: Undefined variable "${defTarget:q}" pmake: Unknown modifier 'q' ... 382. Date: Thu, 24 Aug 89 21:50:16 PDT From: Fred Douglis <douglis> Subject: Re: mkmf oops. there was a typo in mkmf.biglib. the extra space was in the mkmf script, not in the makefile. it's fixed now. 383. Date: Fri, 25 Aug 89 08:40:27 PDT From: ouster (John Ousterhout) Subject: Piquante won't boot David Culler has been trying unsuccessfully to boot piquante this morning. After the command "boot -f tftp()", the following messages appear: TFTP Error: 1 (file not found) TFTP Error: 1 (file not found) TFTP Error: 1 (file not found) TFTP Error: 1 (file not found) couldn't load tftp Can someone who understands ds3100's better than I do (Bob? Fred?) give David a hand in getting his machine booted again? Thanks. -John- P.S. I'm wondering if the problem is a well-intentioned Ultrix TFTP daemon responding to the broadcast before Sprite does. 384. Date: Fri, 25 Aug 89 08:51:06 PDT From: Fred Douglis <douglis> Subject: Re: Piquante won't boot I get that any time I try to boot with tftp without saying "init" to the prom beforehand. Had he tried that? 385. Date: Fri, 25 Aug 89 08:58:18 PDT From: ouster (John Ousterhout) Subject: Re: Piquante won't boot At your suggestion I tried "init", but it didn't work. I also tried power-cycling the machine, which also didn't help. 386. Date: Sun, 27 Aug 89 21:21:41 PDT From: Fred Douglis <douglis> Subject: debugging hosts did anyone have a chance to poke around murder in the debugger before rebooting it? i need to look in the debugger any time something like this happens. also, it would be very useful for bug reports to say not only which hosts are involved with a problem, but which kernels they are running. Having monotonically increasing version numbers is a wonderful idea because it makes it much easier to identify kernels. I noticed that Brent set up his own directory to do something similar, so I copied his Makefile setup to my own; for example, right now I'm running Kernel version: SPRITE VERSION FD.001 (ds3100) (25 Aug 89 18:34:42) 387. Date: Fri, 25 Aug 89 10:36:09 PDT From: Fred Douglis <douglis> Subject: ds3100 stuff [john, sorry for the duplication due to my typo] i noticed piracy was in the debugger and tried to debug it. however, i couldn't find out which kernel it is running, because kmsg -v doesn't work, and i misguessed. you might as well reboot. also, brent and i had trouble finding the unstripped binary corresponding to the installed ds3100 that dave culler is running. turns out someone removed it or overwrote it on sprite, but i had copied it to dill in the form "ds3100.new" a few days ago. we really need to be careful about keeping debuggable versions, especially on dill (in /sprite/src/kernel/nelson, at the moment, which is on dill's local disk). finally, are the rdists of kernel sources to unix being done automatically, finally? dill mounts /sprite3 and i have set up the debugger search path to look there. 388. Date: Fri, 25 Aug 89 11:07:19 PDT From: culler (David Culler) Subject: IO error from EMACS When ``that evel editor'' (EMACS) tries to write a file to a pseudo-file system it gets an "IO error". Apparently this arises when EMACS tries to sync the file to make sure it is written, as the write was successful. 389. Date: Fri, 25 Aug 89 11:48:56 PDT From: ouster (John Ousterhout) Subject: Kernel names in /sprite/src/kernel/sprite Perhaps all this has been fixed in the recent changes, but it used to be that each recompilation in /sprite/src/kernel/sprite moved the "current" kernel (e.g. sun3) to one with a date appended to its name. This is all fine, except that there was no obvious way to tell which of the many old sun3 kernels corresponded to what was installed as sun3.new, or, more importantly, sun3. Hence at one point I accidentally removed the only unstripped copy of the sun3 kernel while trying to cleanup up irrelevant binaries. Does the new naming scheme make it clear which unstripped kernels correspond to "official" versions? If not, it would be nice if it did. 390. Date: Fri, 25 Aug 89 12:11:52 PDT From: rab (Robert A. Bruce) Subject: Re: ds3100 stuff There is a shell script in /sprite/lib/misc/distfile.kernel to rdist the kernel sources. I isn't run from the crontab right now because there is a problem. When sprite attempts to find the size of a file on ginger it gets the wrong size, so every file is copied every time. I am not sure what the problems is. I suppose we could put it in the crontab anyway for now. Does anybody have any ideas as to why the sizes are getting screwed up? 391. Date: Sun, 27 Aug 89 21:46:44 PDT From: Fred Douglis <douglis> Subject: ds3100 getting repeated floating-point interrupt in kernel Garth commented that he crashed a couple of ds3100s (pepper and parsley) running his simulator on them. Turns out parsley was in the same state as pepper, but this time ^C followed by "run" in kdbx (not normally needed, I thought) made me able to poke around. It was in a panic due to an FP interrupt in kernel mode. This happened once before and Mike said to let him know if it happened again, I think. i'll mail the kdbx session to Mike in case it's of use. it includes mach_DebugState. 392. Date: Fri, 25 Aug 89 12:21:41 PDT From: rab (Robert A. Bruce) Subject: unkillable process The dump died last night. When I tried to restart it I got this error: Can't open /hosts/murder/dev/exabyte.norewind: text file or pseudo-device busy The process that has it open is 9112c WAIT 2:04 tar ncfT - - This process completely ignores `kill -DEBUG' and `kill -KILL'. The process is still alive on murder if anyone wants to look at it. 393. Date: Fri, 25 Aug 89 13:25:12 PDT From: brent (Brent Welch) Subject: Re: Kernel names in /sprite/src/kernel/sprite The Makefile saves the ${TM} kernel image in ${TM}.version at the end of the script. It is easy to revert and leave the kernel in ${TM} and do the rename before you make the next version. We can vote on this at meeting. 394. Date: Fri, 25 Aug 89 14:08:43 PDT From: eklee (Edward K. Lee) Subject: gremlin I'm trying to run gremlin remotely using forgery's monitor but gremlin complains: "Couldn't open font file" I was able to run xdvi remotely. 395. Date: Fri, 25 Aug 89 15:41:14 PDT From: Fred Douglis <douglis> Subject: pmake garbling explained after looking carefully at the pmake output, we realized what was happening. the shell would read some commands, then suddenly start reading from the beginning again. We figured this had to be because of eviction. brent has found two shared-offset bugs so far, one for reading and one for writing, and i have a program that can recreate the problem, though only for full 4096-byte reads, not the smaller reads that sh does. anyway, thanks for the suggestions. my check for the file existing did catch the fact that there are occasionally leftover files in /tmp with the same processid, but as it turns out, "w+" truncates as well so that wasn't really the problem. 396. Date: Fri, 25 Aug 89 16:06:54 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: sprite rpc and gateways Right now it looks like the gateway between evans and cory is changing random words. We don't have a checksum mechanism to protect against this. Every time tonkawa boots at least one program contains an illegal instruction. As a result the spur cluster in Cory is unusable. I have set tonkawa up to use as much stuff off of its local disk as possible, but this is only a partial fix since some things still need to access /sprite. 397. Date: Fri, 25 Aug 89 18:03:16 PDT From: mgbaker (Mary Gray Baker) Subject: Question about file system cache I compiled a new test kernel for the sun4 in my kernel directory. In /sprite/boot I had a symbolic link to it. When I tried to reboot, tftp said there was no such file. But there was. It turns out the file system was full, although I got no write-back errors when I compiled the kernel. When I cleaned out some space elsewhere in the file system, tftp found the file. Shouldn't I see a message about write-backs not working? I probably don't understand what's going on, but I assume this all happened because the file was still in the client's fs cache. I guess there's nothing that can be done about it, but it seems a weird kind of caching to me if references to the file can't find what's cached for the file. Yeah, I know it's on a different machine, but the behavior still seems weird to me. 398. Date: Fri, 25 Aug 89 18:06:45 PDT From: Fred Douglis <douglis> Subject: Re: Question about file system cache As I just told Mary in person, the lack of a message is because the link took place on another host and the messages went to its syslog. We should figure out how to do something about this. 399. Date: Fri, 25 Aug 89 18:16:35 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: fix to #281 There are new versions of rn, inews, Rnmail, and Pnews installed that fix bug #281 (among other things). For future reference, if someone installs rn and has to run the Configure script, tell it you don't want the programs to be portable. That will cause rn to think the host name is always 'sprite.berkeley.edu'. 400. Date: Fri, 25 Aug 89 18:46:42 PDT From: Fred Douglis <douglis> Subject: migration signal race condition (hopefully) fixed Brent reported earlier that a script he wrote to test shared offsets would often hang. I looked into it and found the problem was primarily in the check in Sig_Pause that would cause the user-level library to repeat Sig_Pause in the event that a migration signal was pending. In fact, it should only repeat if the *only* signal pending is migration related. In addition, while I was looking at potential causes, I realized there's a race condition when sending signals to a process that's about to migrate back home. I think I fixed that too, though there may still be a tiny window of vulnerability I'll have to investigae. Fixed in the uninstalled proc & sig for ds3100. I'll compile for the other machine types now. 401. Date: Sun, 27 Aug 89 12:14:58 PDT From: ouster (John Ousterhout) Subject: Migration hangup Migration seems to have caused creeping paralysis in Mace this morning. I ran pmake, noticed that it wasn't doing anything, and also noticed the following message in my syslog window: RpcDoCall: <mig command> RPC to murder is hung Sure enough, murder seemed to be dead (no response to rlogins, for example). However, I was unable to control-C the pmake process (no response in the window where I typed control-C). I then tried "kill -KILL" on the "sh -ev" process that was hanging during migration, and that just hung the shell where I typed the kill. Finally, I typed "kmsg -d murder" in another window, at which point the following messages appeared in my syslog window, and everything cleaned itself up: <mig command> RPC exit 0x30002 <mig command> 8/27/89 12:11:44 murder (17) RPC timed-out Warning: Proc_MigrateTrap: error encountered sending encapsulated state: no Reply to an RPC request within a threshold time limit. <mig command> 8/27/89 12:11:51 murder (17) RPC timed-out At this point I continued murder and everything seems OK, at least for now. What I don't understand is why I had to put Murder into the debugger before migration cleaned itself up. 402. Date: Sun, 27 Aug 89 12:57:53 PDT From: ouster (John Ousterhout) Subject: gdb not killing process: repeatable? I think I know how to reproduce the problem where gdb hangs while killing a process: 1. Start up gdb on a process. Get the process running, then get back into gdb, say, via a breakpoint. 2. Recompile the program being debugged. 3. Now go to the gdb process and type "kill". The kill will hang until the process is manually killed from some other window. I've been able to make this happen repeatably (in ~ouster/mipsim). I suspect that it might be a bug in gdb: I also noticed that gdb is unhappy if you remove the executable being debugged and then try to kill from within gdb: I got the message /user1/ouster/mipsim/sun3.md/mipsim: no such file or directory. 403. Date: Sun, 27 Aug 89 14:01:26 PDT From: mgbaker (Mary Gray Baker) Subject: bug #225 has disappeared I've checked up on the bug I reported about sun3 include paths being used by default for sun4 compilations. It seems to be fixed, at least in all the test cases I could think of, so I removed the explicit sun4 include paths from the library.mk files, etc. 404. Date: Sun, 27 Aug 89 18:04:31 PDT From: gibson (Garth Gibson) Subject: mkdir error message The error message generated by "mkdir dirX" in an NFS directory where dirX already exists is not very informative: *** compat: Invalid message # for Gen module: status = 0x11 mkdir: submit: invalid argument 405. Date: Mon, 28 Aug 89 10:32:18 PDT From: brent (Brent Welch) Subject: Re: Question about file system cache If the ld of your sun4 kernel migrated to a different machine then the disk full messages probably appeared there. If you checked Oregano's console you may have seen the messages there, too. An open will fail if the last writer of the file cannot write it back. I think this is the best behavior. It's better to abort the open than to get bad data. I'm not sure what error code is returned in this case, and perhaps that can be fixed. Even so, I don't think too many programs expect a "disk full" error from open(). 406. Date: Mon, 28 Aug 89 10:40:44 PDT From: brent (Brent Welch) Subject: Re: mkdir error message The problem is that nfsmount is returning a UNIX error code and then the compatibility library is trying to map it from a Sprite to a UNIX code. I'll take a look at nfsmount. Eventually we'll convert back to all-UNIX error codes, but don't hold your breath. 407. Date: Mon, 28 Aug 89 11:34:19 PDT From: eklee (Edward K. Lee) Subject: screen blanking on ds3100 screen blanking does not seem to work on the ds3100. 408. Date: Mon, 28 Aug 89 11:36:41 PDT From: Fred Douglis <douglis> Subject: Re: screen blanking on ds3100 sometimes it does, sometimes it doesn't. if you're going to be gone for a while, run "xgone" to make sure you have a screensaver running. 409. Date: Mon, 28 Aug 89 11:38:55 PDT From: Fred Douglis <douglis> Subject: Re: screen blanking on ds3100 p.s. my last note was a bit terse, as i realized after i sent it. thanks for the report, and it's certainly something someone should look into at some point. i mentioned xgone as an interim solution, which means fixing the screensaver should be done but isn't as high a priority as it might otherwise be. 410. Date: Mon, 28 Aug 89 13:12:23 PDT From: Fred Douglis <douglis> Subject: another full fs bug I was trying to come up with a better test case for the pmake garbling bug (one that would demonstrate when the bug is truly fixed). I made the mistake of creating new files, with different $$ process ids, instead of reusing the same files. When /tmp filled up, and fenugreek tried to evict something writing to /tmp, the process froze and became unkillable. I wasn't aware that space was a problem, of course, since the message went to fenugreek and I was rlogin'ed. I went to lunch, and the problem resolved itself when space was freed. What happened here was that the fs callback took place with the process locked. I think I can fix this problem by changing migration not to keep the process locked while deencapsulating it, except for proc-related operations. 411. Date: Mon, 28 Aug 89 14:15:41 PDT From: Fred Douglis <douglis> Subject: Re: access to printer lw608-8 Ann, Printing from the decstations needs work. I've found that if I send something, it usually complains that the daemon doesn't exist, and if I then print something from a sun3 both the file(s) spooled from the ds3100 and the new file from the sun3 get printed. For the time being, I'd recommend that you rlogin to a sun3 and print from there. Also, please send mail about things on sprite not working to "bugs" rather than "root". They then get automatically filed and indexed accordingly. 412. Date: Mon, 28 Aug 89 14:53:44 PDT From: Fred Douglis <douglis> Subject: fs shared offsets race condition i just talked to brent some more about the file system migration problem. he's fixed some bugs already and will be testing the fixes on murder soon. but he just came up with another pathological case we have to deal with. consider the following sequence of events: process 1 forks process 2 with shared descriptor descriptor is at offset * processes 1 and 2 are told to migrate process 1 gets signal, starts to be encapsulated process 2 does I/O using shared descriptor, offset ** process 2 gets signal, starts to be encapsulated process 2 completes migration other host gets offset ** for descriptor process 1 completes migration other host gets [old] offset * for descriptor brent suggested that we might associate a timestamp with each encapsulation, so that an earlier offset couldn't overwrite a later one. i'm a bit worried that this might affect one symptom without curing the whole disease -- the idea of side-effect free, parallel encapsulation has me worried. if anyone has ideas of other pathological cases that might arise, please speak up. the design of the fs-migration interaction might warrant some discussion at an upcoming meeting. 413. Date: Mon, 28 Aug 89 15:29:51 PDT From: ouster (John Ousterhout) Subject: Re: fs shared offsets race condition It sounds to me like the problem with shared offsets is that they aren't handled at the right time during migration (i.e. there's a window of time where the offset is "neither here nor there"). If an offset is shared, or even "possibly shared", wouldn't it be better to have the server take over responsibility for the offset at the beginning of migration rather than the end? Thus Fred's scenario would look like this: process 1 forks process 2 with shared descriptor descriptor is at offset * processes 1 and 2 are told to migrate process 1 gets signal, starts to be encapsulated -- during encapsulation, offset becomes shared, so server -- takes over responsbility for it. Server's offset = * process 2 does I/O using shared descriptor, offset ** -- I/O is sent through to server, so server's offset gets -- updated to ** process 2 gets signal, starts to be encapsulated process 2 completes migration -- since offset is shared, process 2's new host doesn't -- get offset at all. process 1 completes migration -- same as note above. In the unlikely even that process 1 -- and process 2 are now on the same host again, so that -- the offset is no longer shared, the server could notify -- the client (during de-encapsulation) to cache the offset -- locally. The server would pass the client the correct -- offset to cache (**). Wouldn't this approach eliminate the window of vulnerability? I share Fred's concern that timestamps might solve one symptom while leaving other vulnerabilities; they smell tricky to me. 414. Date: Mon, 28 Aug 89 16:34:17 PDT From: Fred Douglis <douglis> Subject: pdev, device deadlocks; kgdb backtracing mace got doubly wedged today. first, from paprika, mig -h mace csh -c "tail -f /dev/null&" caused the tail processes on mace to become unkillable, waiting in an RPC back to paprika. i'll debug paprika's end as well, once it's available. second, mace had an rlogind process waiting for a pdev open because the pdev was marked busy. the rlogind was unkillable. any other process trying to open the /hosts/mace/rlogin1 file it got blocked on also was wedged and unkillable. finally, i couldn't find out that much on mace because kgdb backtracing broke: after switching kernel stacks and going up stack frames, kgdb got confused and "info reg" produced ERROR: invalid read address 0x0 as did any commands to print local variables. 415. Date: Mon, 28 Aug 89 18:30:56 PDT From: gibson (Garth Gibson) Subject: Not a sprite bug, but .... This is not necessarily a Sprite bug, but Spriters may need to beware. I had a file on Unix that I "mx"d on Sprite through NFS. I changed much and increased its length substantially. I had Mx write it then tried to print it on rosemary. An old version was a bunch of trash at the end was printed. Sprite and other sun unix machines see the correct file, but rosemary appears to have suffered a caching problem. I should note that rosemary had been touching the file before and during the Mx session and the other unix machines did not touch it until Mx had quit. I should also note that I almost never use Mx across NFS (vi for NFS, Mx for Sprite - everything in its place). Rosemary remains confused about the file (even after a sync). 416. Date: Mon, 28 Aug 89 19:30:42 PDT From: mgbaker (Mary Gray Baker) Subject: Re: Not a sprite bug, but .... This problem plagued me frequently while I was doing the sun4 port working from rosemary. You can fix it by moving the file to a new name (on sprite) and touching and removing the old file name (from unix) and then moving the file back to its old name (from sprite) and then reaccessing it (from unix). 417. Date: Fri, 1 Sep 89 12:24:40 PDT From: eklee (Edward K. Lee) Subject: clarification on gremlin bug I was running sun3.new on mustard when I discovered the following bug. While running gremlin on ~eklee/raid.cont/config.grn, doing a pan (downward in this particular instance) gremlin crashed. Not only did gremlin crash, but I lost control of the mouse as well (Mustard was still up). Panning does not always cause gremlin to crash, but after you do several pans you gradually lose functionallity. The first thing to go is your snap factor. It becomes very large for some reason and you can not get it below a certain level. Next, objects are displaced haphazardly. Finally, it becomes difficult to control the direction and magnitude of panning. 418. Date: Fri, 1 Sep 89 17:21:53 PDT From: mgbaker (Mary Gray Baker) Subject: mail bug When I responded to John's mail about SYS_MAX_ARGS, using the "r" command, the mailer changed the bugs address to the bogus address sprite.berkeley:bugs@edu Below is the mailer daemon report. >From mgbaker Fri Sep 1 17:11:20 1989 Received: by sprite.Berkeley.EDU (5.59/1.29) id AA919609; Fri, 1 Sep 89 17:11:17 PDT Date: Fri, 1 Sep 89 17:11:17 PDT From: MAILER-DAEMON (Mail Delivery Subsystem) Subject: Returned mail: Host unknown Message-Id: <8909020011.AA919609@sprite.Berkeley.EDU> To: mgbaker Status: R ----- Transcript of session follows ----- 550 sprite.berkeley:bugs@edu... Host unknown ----- Unsent message follows ----- Received: by sprite.Berkeley.EDU (5.59/1.29) id AA919600; Fri, 1 Sep 89 17:11:17 PDT Date: Fri, 1 Sep 89 17:11:17 PDT From: mgbaker (Mary Gray Baker) Message-Id: <8909020011.AA919600@sprite.Berkeley.EDU> To: jhh@sprite.Berkeley.EDU, sprite.berkeley:bugs@edu Subject: Re: SYS_MAX_ARGS redefined Oops. My fault. I thought I'd privately defined that one in machConst.h and moved it to sysSysCall.h when I started using the ASM stuff. I'll fix it. 419. Date: Fri, 1 Sep 89 17:22:48 PDT From: brent (Brent Welch) Subject: tx killed, csh -i looped 2 bugs - I killed tx by running my error stress test for the read system call. I passed a bad pointer for the read buffer, tx got an error from the pseudo-device code, and exited. I understand how to make this better - currently the code can't tell if the pseudo-device's request buffer is bad, or the user has a bad buffer; the cross-address space copy just gets a fault and it doesn't know who caused the problem. I can fix this by added extra code to determine what buffer is bad after the error occurs. bug 2 - the csh -i child process of the tx that paniced when into an infinite loop. I'm not sure what it was doing, but I imagine that this is repeatable. Repeat by: cd /sprite/src/benchmarks/read read -e (while running in a tx window, of course) 420. Date: Fri, 1 Sep 89 18:56:33 PDT From: douglis (Fred Douglis) Subject: kamikaze l1 key i accidentally hit l1-h instead of l1-k, or something like that. at least, the debugger said i was in the state i'd be in had i hit l1-h. unfortunately, that state was "in the debugger with a bus error exception"... looks like the routine to dump name hash stats needs to be a little more careful. this was repeatable on a sun3 after i killed a ds3100. kids, don't try this at home. 421. Date: Tue, 5 Sep 89 13:35:04 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: X server on ds3100 dies My X server frequently dies on the ds3100. It typically happens right after I start it. Sometimes my xclock window starts out huge -- the server always dies after this happens, but it also dies if it doesn't happen. The error message always is: X Error: request length incorrect; internal Xlib error Request Major code 74 Request Minor code ResourceID 0x200040 Error Serial #409 Current Serial #422 This is rather annoying since I have to kill and restart the ipServer each time and it takes more attempts to get the X server to stay up than I have patience for. I have been running kernel version 1.010 and some of my own kernels which use the uninstalled sources. I don't see any hope of fixing this bug since we don't have the sources, but I thought I'd get it recorded for posterity anyway. 422. Date: Tue, 5 Sep 89 15:16:54 PDT From: pmchen (Peter M. Chen) Subject: eqn differing behavior for sprite and unix One of my ditroff files prints out correctly under unix and not under sprite. The difference that I noticed is fractions have overlap between numerator and denominator (under sprite). The example file is in sprite:~pmchen/amdahl/sigmetrics/paper[12]. This should be the same as unix:~pmchen/sig/sigmetrics/paper[12]. To format the file, cd to ~pmchen/amdahl/sigmetrics (or ~pmchen/sig/sigmetrics) tbl -Ppulla paper* | grn %lw | eqn | ditroff -me %lw -h One of the example differences is on page 7, 5 text lines down from the top of the page. (N-1/N) 423. Date: Tue, 05 Sep 89 15:20:01 PDT From: Fred Douglis <douglis> Subject: problems with IP I wasn't able to log in to various unix machines -- for example, I could talk to ginger but not dill or rosemary. I then found I couldn't log into mint either, though migrating onto it showed that its ipServer was alive. It turned out there was a finger in the debugger and a bootp in an infinite loop. when i killed them off (I couldn't debug using migration), I could get arp responses and kvetching could now talk to other hosts, but I still couldn't log into mint. I then noticed that someone was in the process of running "restartservers" on mint, and that portmap was now in the debugger. I take it someone else walked over to mint to restart stuff. 424. Date: Wed, 06 Sep 89 11:07:49 PDT From: Fred Douglis <douglis> Subject: sld bug I tried to reinstall a new pmake without the debugging files, but the spur version wouldn't link. sld complained about the -mspur flag. 425. Date: Wed, 06 Sep 89 11:50:53 PDT From: Fred Douglis <douglis> Subject: another kiss of death paprika migrated onto, and killed, three hosts in parallel. fenugreek died with a "stack format error" exception. i'm checking mace now. what's more, paprika is acting strangely -- mary tried using a tx "set termcap" menu entry and it produced garbage. I wasn't able to find out too much on fenugreek, and am inclined to file this report and leave the problem alone unless it repeats. 426. Date: Wed, 06 Sep 89 12:29:41 PDT From: Fred Douglis <douglis> Subject: migration problem resolved: floating point problem? The problem from before happened during pmakes but not during explicit migrations using mig. Also, it happened just after i installed a new sun3 pmake, though I hadn't thought about that when the problem arose. i backed out pmake. must have something to do with programs that use hardware floating point. 427. Date: Wed, 6 Sep 89 14:44:54 PDT From: pmchen (Peter M. Chen) Subject: mustard crashed hard i was compiling a program in ~pmchen/raid cc -g -o multnew multnew.c -lm Message was: Exception 34 format at 0E007314 428. Date: Wed, 06 Sep 89 14:52:53 PDT From: Fred Douglis <douglis> Subject: Re: mustard crashed hard the sun3 cc was just reinstalled last night. were you doing the cc by hand or using pmake, which might have been doing it remotely? even if pmake didn't use the hardware floating point, if cc got migrated away and then evicted, it could have crashed your machine. i see you rebooted mustard. next time this happens, please try to login elsewhere, or call, to report the bug and give people a chance to look into the crash with the debugger. it's hard to diagnose after the fact. 429. Date: Thu, 7 Sep 89 10:14:52 PDT From: ouster (John Ousterhout) Subject: Mail return address Mail from us is still going out with a return address of "ouster%sprite.Berkeley.EDU@ginger.Berkeley.EDU" instead of just "ouster@sprite.Berkeley.edu". Won't the shorter form work OK (I've used it from WRL, for example)? If it works, can we change sendmail to use it? 430. Date: Thu, 7 Sep 89 10:31:22 PDT From: brent (Brent Welch) Subject: redirection bug? The following sequence of commands: rdate %timeServer > /dev/null & echo `date` `sysstat -v|sed -e 's/^Kernel.*1\.0 //' -e 's/) (/ /'` >! /hosts/%host/boottime cat /hosts/%host/boottime >> /hosts/%host/boottimes Occasionally puts more into the boottime file that expected: >>>> [1] Done rdate mint.Berkeley.EDU > /dev/null Thu Sep 7 02:22:38 PDT 1989 sage SPRITE VERSION 1.010 (sun3 30 Aug 89 17:20:32) <<<< There is an extra linefeed (^M) and the job control message, as well as the date and kernel stamp generated by the echo. This may well be a bug in csh, for all I know. But the csh output regarding the job gets sucked into the standard output stream of the next command. 431. Date: Thu, 07 Sep 89 12:11:07 PDT From: Fred Douglis <douglis> Subject: migration deadlock paprika wedged last night, and it only came back to life when it panicked with a full process queue. Turns out it did an open of /user1, which waited for recovery, and then deadlocked on the process itself because the process was locked during the open. I'll change it. I'm surprised this didn't bite us before (or maybe it did and we just didn't know it). 432. Date: Thu, 7 Sep 89 13:55:45 PDT From: douglis (Fred Douglis) Subject: allspice rpc wedge allspice stopped responding to RPCs. It could ping other hosts but they couldn't ping it. When I rebooted, I got a bunch of quick messages about hosts doing recovery, which implies that the act of shutting down killed something that was locking things up. An rpcstat -srvr showed a bunch of wait channels plus a consistently busy channel, with thyme doing a remove. 433. Date: Thu, 7 Sep 89 16:43:34 PDT From: shirriff (Ken Shirriff) Subject: ipServer problem on mint The ipServer went into the debugger with a bus error in CallTimeoutHandler line 806. I couldn't find the source files to debug further. 434. Date: Thu, 7 Sep 89 18:32:48 PDT From: pmchen@basil.berkeley.edu (Peter M. Chen) Subject: mustard crashed Message was: Entering debugger with a Bus Error exception at PC 0xe06798c Message in the syslog window was Fsdm_DomainFetch, bad domain number <341> I called Bob about it, he's looking into it. I need to reboot soon, so I'll do that when he's done. 435. Date: Fri, 08 Sep 89 10:59:31 PDT From: Fred Douglis <douglis> Subject: need new migration version for sun3s The recent change to the machine state caused an incompatibility between kernels. I am going to change migration to pass the size of key structures, such as Mach_UserState, to catch this sort of thing in the future. In any case, we need to build new kernels with a different migration version. (I think Bob may have been testing new kernels with a different version, but when I built my kernel with the uninstalled mach the other day I didn't know to do that.) 436. Date: Fri, 8 Sep 89 13:13:31 PDT From: eklee (Edward K. Lee) Subject: pmake could not find non-local include files I generated a Makefile after specifying non-local include directories via CFLAGS += -I../sim in a local.mk file. mkmf was able to find the non-local include files but when I tries to run pmake it complained that it does not know how to make the non-local include files. The program I tries to compile is in ~eklee/raid.sim. 437. Date: Fri, 8 Sep 89 14:38:41 PDT From: shirriff (Ken Shirriff) Subject: ipServer bug The ipServer crashed on me again. The problem is timeoutList has the list pointer values (1,1) which give a seg fault. I suspect that memory is getting overwritten somewhere and is clobbering timeoutList, but I couldn't figure out where this was happening. 438. Date: Fri, 8 Sep 89 17:01:11 PDT From: mgbaker (Mary Gray Baker) Subject: printer problem I tried printing out some files, and when they seemed to be taking a long time I checked the queue. It said it was waiting for paprika to come up. Paprika had been in the debugger for a long time, so I rebooted it. When paprika came up, nothing printed and the queue said it was empty. 439. Date: Fri, 8 Sep 89 17:37:44 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: sethostid broken sethostid is an ultrix binary that is used by the ds3100's during boot. Assault will not boot properly because sethostid dies with a bus error. I was trying to boot ds3100.new. Sethostid works on hijack, so there is something special about assault. 440. Date: Mon, 18 Sep 89 12:18:05 PDT From: Fred Douglis <douglis> Subject: ds3100 pagein at interrupt level Well, I've hit a new bug for the ds3100, though it could explain other problems; who knows? kvetching died with an "interrupt" exception. Its pc was in Mach_EnableIntr at the point where it returns after enabling interrupts. It was in an RPC page read at the time, with a backtrace going all the way up to: 20 Vm_PageIn(virtAddr = 0x10005000, protFault = 0) ["vmPage.c":1523, 0x800cd114] 21 .block544 ["jhh.md/vmPmax.c":1465, 0x800d1dec] 22 VmMach_TLBFault(virtAddr = 0x10005000) ["jhh.md/vmPmax.c":1465, 0x800d1dec] 23 .block13 ["jhh.md/machCode.c":1022, 0x80034644] 24 MachKernelExceptionHandler(statusReg = 64560, causeReg = 805314572, badVaddr = 0x10005000, pc = 0x800d2ffc = "") ["jhh.md/machCode.c":1022, 0x80034644] 25 Mach_KernGenException(0x800fa298, 0x34, 0xc0109234, 0x2, 0xc054ff54) ["jhh.md/machAsm.s":506, 0x80032854] 26 Vm_MachDumpTLB(0x800fa298, 0x34, 0xc0109234, 0x2, 0xc054ff54) ["jhh.md/vmPmaxAsm.s":719, 0x800d2ff8] IdleLoop looks like it was trying to panic, because the interrupt nesting wasn't 0, but the check I put in Interrupt beat it to it. Unfortunately, it never made it to the screen (maybe got buffered for my syslog instead), so I didn't know what was going on -- Interrupt used printf instead of panic. Anyway, how do we keep the whole page in from being done at interrupt level? Should it be done in the first place if it's because of a TLB flush? 441. Date: Mon, 18 Sep 89 22:31:44 PDT From: mgbaker (Mary Gray Baker) Subject: rawstat in the debugger Many invocations of the program rawstat seem to pile up in the debugger on anise, and tonight when dumping the process table on mint, I noticed rawstat was in the debugger there as well. 442. Date: Tue, 19 Sep 89 11:35:11 PDT From: douglis@rosemary.Berkeley.EDU (Fred Douglis) Subject: mint rpc wedge mint wedged during recovery again. This time I got into the debugger and poked around. I found a ton of rpc servers waiting on their BUSY flag and the rpc daemon doing an RpcProbe to sage. Is it possible that the daemon is ignoring other RPCs while its RpcProbe is taking place, or something? Anyway, when I wasn't able to find a definite cause of the problem, I continued mint, and this time people seemed to recover okay. (an aside: assault was shutdown during the interim, so if there's anything going on relating to the number of hosts recovering simultaneously, that might be relevant.) also, sage must be wedged itself. it recovered with mint only when i typed at its console, and even then, i wasn't able to start "ping" when i tried to ping mint from sage. nor does sage respond to pings, or let me ^C out of the ping. I'm debugging it now. 443. Date: Tue, 19 Sep 89 12:45:24 PDT From: Fred Douglis <douglis> Subject: cc bug: rpn won't compile I installed a new rpn, with a patch from Andy that fixes the hex display problem for large numbers. However, when I tried to recompile for the sun3 to make sure I didn't break anything, I found that cc hits a bus error trying to compile src/main.c. Any cc guru care to take a look? 444. Date: Tue, 19 Sep 89 14:10:22 PDT From: Fred Douglis <douglis> Subject: disk library won't compile c/disk no longer compiles -- complains that kernel/fsDisk.h no longer exists. I looked for kernel/fs*Disk but couldn't find a renamed version. What gives? 445. Date: Tue, 19 Sep 89 15:36:02 PDT From: brent (Brent Welch) Subject: Re: fs header file changes fsDisk.h is now fsdm.h. I recently moved all the old versions of fs header files in Include to a different place so old code that should be fixed won't compile. 446. Date: Tue, 19 Sep 89 15:57:15 PDT From: pmchen (Peter M. Chen) Subject: latest crash on raid Fatl Error: VmMach_DMAAlloc: unable to satisfy request for 65536 bytes at 0xf655c8b8 This was whe 8 processes each requested 64KB. The kernel was sun4.md/mgbaker ~pmchen/raid/mult/ex1 /dev/rsvj1 1 (I was in ~pmchen/raid/mult) 447. Date: Tue, 19 Sep 89 15:54:46 PDT From: shirriff (Ken Shirriff) Subject: rcp from decstation hangs I tried to copy a kernel from pride (decstation) to dill and the first time it stopped after copying 106496 bytes and the second time it stopped after copying 270336 bytes. By stopping I mean the rpc command sat there for several minutes and then gave rpc: lost connection. I tried the copy from nutmeg (sun3) and all 1929348 bytes were copied without problem. 448. Date: Tue, 19 Sep 89 18:18:39 PDT From: brent (Brent Welch) Subject: MACH_EXC_BUS_ERR_LD_ST panic Apathy crashed on Garth with a panic from MachUserExceptionHandler. It got a fault 'cause' of MACH_EXC_BUS_ERR_LD_ST, and panic'd with a message: "User bus error on ld or st". Why is this a panic? 449. Date: Wed, 20 Sep 89 10:17:34 PDT From: gibson (Garth Gibson) Subject: what are these "LE ethernet: Received packet with CRC error." messages? I've seen them shortly after starting X11 on apathy and just now shortly after login to pepper (both ds3100s). pepper runs FD.029 (CLEANds3100) (19 Sep 89) 450. Date: Wed, 20 Sep 89 15:59:24 PDT From: Fred Douglis <douglis> Subject: mkmf change and bug fix The implementation of mkmf was different from the documentation. The documentation claims that if ./mkmf.local exists, it will be used, but the program actually looked for ./mkmf -- which is a mistake since if someone has "." in the path before /sprite/cmds, they'll invoke the script without the proper environment variables. I've changed mkmf. If anyone was relying on the broken behavior, and had a "mkmf" script instead of "mkmf.local", they should change it. 451. Date: Thu, 21 Sep 89 12:18:19 PDT From: pmchen (Peter M. Chen) Subject: raid crash I crashed raid by running 16 concurrent processes, each asking for 512 bytes. Actually, I think only 6 of them got started running. Only 6 * 512 bytes should easily fit in the DVMA space, yes? Nothing came out on the /dev/syslog, and I'm not at the console to look, but I'll ask Ken (or whoever) to look at the console when he gets in and send you the message. Ed remembered that the requests are aligned on some large boundary (128K?) to avoid some of the cache flushing problems. What happens if alignment is not possible? 452. Date: Wed, 20 Sep 89 16:31:17 PDT From: Fred Douglis <douglis> Subject: mkmf bigcmdtop bug If you say mkmf at the top level before running mkmf in the subdirectories, it tries to make depend and complains that */Makefile doesn't exist. 453. Date: Wed, 20 Sep 89 16:46:56 PDT From: brent (Brent Welch) Subject: hung gdb There is a hung gdb process on sage. I quit gdb while the program was at a breakpoint. The program was not continued by gdb, and it hung. I'm leaving it in the current state, and I'm even willing to let someone debug sage (ask first!) if they need to. I was able to suspend gdb and put it into the background, and I could probably kill it too. However, it shouldn't behave this way so it would be great if someone took a look at it. 454. Date: Thu, 21 Sep 89 09:46:17 PDT From: ouster (John Ousterhout) Subject: Another trashed file The file ~ouster/162/notes/t05 has become corrupted sometime between January 6 and today: the end of the file is a bunch of control characters (perhaps some machine code?) preceded by the following characters: openOpen file %s lseekreadError 0x%x from Proc_SetPriority seek time %4d.%-03d seek and read to 0x%x time %4d.%-03d I moved this file to /user1/trashed. 455. Date: Thu, 21 Sep 89 11:02:51 PDT From: Fred Douglis <douglis> Subject: cc1.68k optimization bug cc1.68k goes into the debugger trying to compile /a/newcmds/ixgraph/src/xgraph.c. This file compiles okay on the ds3100 and also compiles okay when optimization is disabled. 456. Date: Thu, 21 Sep 89 12:08:01 PDT From: brent (Brent Welch) Subject: recovery trashed file I caught a file getting corrupted after recovery. I was generating data to a file when oregano crashed. The last block ended up having data from a temporary .s file. There was 2640 bytes in the 4th block, and they were all from the wrong file. I suspect that the output file was caught in the middle of growing a fragment (from 2K to 3K) and the cache didn't get written out properly when Oregano crashed. I'm pretty sure the file was not being cached on the clients because I was generating it at sloth and I had just looked at it on sage. I'll go scan the cache code to see if UpgradeFragment is vulnerable. brent ps. Oregano crashed with the known bug in (sun3) 1.022 hmm... the bug only happens when the cache is full too. the plot thickens. 457. Date: Thu, 21 Sep 89 12:11:11 PDT From: pmchen (Peter M. Chen) Subject: oregano crash--netroute After the oregano crash this morning, I had to manually run netroute -s -f /etc/spritehosts in order to have raid know about oregano. Can this be put in oregano's bootup script? 458. Date: Thu, 21 Sep 89 13:47:14 PDT From: brent (Brent Welch) Subject: Re: recovery trashed file I'm pretty sure my hunch is right. UpgradeFragment is in charge of finding a larger fragment for a cache block that is growing in size from 1K to 2K, 2K to 3K, etc. It does this by fetching the cache block containing the previous version of the fragment (allocation happens before the write), changing the file descriptor's indexing structure, and then unlocking the cache block while assigning it to the new disk location. The order of these last two steps is wrong, I think, especially because the operation that shifts the cache blocks disk address puts it on the dirty list, but it might wait if the old version of the block is undergoing I/O. Thus, the scenario of Oregano's crash (due to a stupid coding mistake of mine that only showed up when the cache was full...) is that the file descriptor was modified to refer to the new location, but the cache block was held up, and it never got onto the dirty list (again) with a new disk address associated with it. Et voila, when Oregano rebooted the file descriptor referenced the wrong fragment. I've simply re-ordered the operations in UpgradeFrament so it unlocks the cache block first, and then updates the file descriptor. Thus the worst case is that the cache block gets sucessully re-assigned to a new block, but the file descriptor doesn't get updated. Oh, it is already true that the old fragment is free'd at the very end, and that seem's ok. The fix for this is in fsdm, and I've got a new sun3.md/brent kernel that has this fix, plus a fix in fscache that caused Oregano to crash in the first place. All hosts that run the newly installed .new kernel are vulnerable to the Bus Error causing bug that is now fixed in fscache. I'll probably make a new .new kernel with the fix. 459. Date: Thu, 21 Sep 89 19:53:12 PDT From: Fred Douglis <douglis> Subject: Re: MACH_EXC_BUS_ERR_LD_ST panic Apathy crashed on Garth with a panic from MachUserExceptionHandler. It got a fault 'cause' of MACH_EXC_BUS_ERR_LD_ST, and panic'd with a message: "User bus error on ld or st". Why is this a panic? The uninstalled mach now kills the user process instead, since I couldn't see any reason for the panic either. Don't think this has made it into a new kernel yet, though. 460. Date: Thu, 21 Sep 89 18:02:24 PDT From: arc%sgi.sgi.com@sgi.sgi.com (Andrew Cherenson) Subject: rcsinfo/rcstell missing? On allspice, rcsinfo & rcstell are missing from /sprite/cmds. 461. Date: Sat, 23 Sep 89 23:57:33 PDT From: tve (Thorsten von Eicken) Subject: help: allspice:/mic seems pretty corrupted I get directories which contain pieces of files and an fscheck (I did: fscheck -dev rsd10 -part c) shows tons of "File nnnnn contains duplicate block nnnnn.". HELP! Can someone see what's bad? I suppose the disk will have to be reinitialized... please try to keep "/mic/tve" (except for /mic/tve/src/ftp, which is also corrupted...) Thanks, Thorsten NB: is there a way to thoroughly test the disk? 462. Date: Sat, 23 Sep 89 17:26:55 PDT From: douglis@rosemary.Berkeley.EDU (Fred Douglis) Subject: mint crash Mint died in FslclLookup because a handle wasn't locked. I am logged in from home so I can't debug too well (no scrollbars :)... but I did see that the name it was trying to open was "./../" if that means anything. Do we have kgcore on unix? Might be nice to have if not.... 463. Date: Sun, 24 Sep 89 19:10:46 PDT From: mgbaker (Mary Gray Baker) Subject: Re: help: allspice:/mic seems pretty corrupted This is the same problem we saw before when Martha Zimet first copied a bunch of new files onto allspice. Are the files actually corrupted, or did you just get a ton of messages from fscheck? If I remember correctly, last time no action was necessary because the files weren't actually corrupted. Some count just wasn't correct and fscheck thought things were unhappy. 464. Date: Mon, 25 Sep 89 08:26:37 PDT From: ouster (John Ousterhout) Subject: Re: /dev/tty bug (was Re: anonymous ftp problem) Removing a bogus /dev/tty is good for now, but I suspect that it's there because there's a program around somewhere that opens /dev/tty in create mode. If this hunch is right, /dev/tty is going to keep re-appearing until we find the program and change it not to create /dev/tty. 465. Date: Mon, 25 Sep 89 09:08:27 PDT From: rab (Robert A. Bruce) Subject: piquante Piquante is in the debugger with a coprocessor unusable exception. MachKernelExceptionHandler: Coprocessor unusable Entering debugger with a Coprocessor unusable exception at PC 0x800c108c 466. Date: Mon, 25 Sep 89 14:25:09 PDT From: ouster (John Ousterhout) Subject: Sendmail died Sendmail went into the debugger on Mace. Anyone interested in looking at it? I'm leaving the corpse around. 467. Date: Mon, 25 Sep 89 17:05:32 PDT From: Fred Douglis <douglis> Subject: update for ds3100 ds3100 update is an old binary. it won't compile under sprite (the installed version must have come from WRL), and it doesn't work running on a ds3100 for ds3100-based files. "update ~brent/postrawstats ~/..." created a directory but didn't copy any files. running on a sun3 worked fine. the problems relate to N_TXTOFF and similar incompatibilities. 468. Date: Mon, 25 Sep 89 20:07:36 PDT From: mgbaker (Mary Gray Baker) Subject: something funny with recovery I came back from aerobics once again to find the machines in a strange state. It appeared that allspice had been rebooted twice. The first time, fenugreek went through recovery. Then, according to fenugreek's syslog, there was a hung rpc echo to allspice. Then allspice rebooted again, but fenugreek didn't get recovery. I went up to allspice, and it thought it was quite happy. I rpc ping'd fenugreek and some other machines, and they responded. After about 5 minutes, and a few ls's and such, all of a sudden a whole bunch of machines went through recovery, including fenugreek. But fenugreek's window system was still frozen. I finally rebooted fenugreek with the new kernel. 469. Date: Tue, 26 Sep 89 07:12:45 PDT From: brent (Brent Welch) Subject: Re: something funny with recovery Two things. First, Allspice crashed with a "non-aligned" read. It printed a message about a 1024 byte read at about 16K and then hung. This happened while I was rebooting mint with the new .new kernel last night. Being in a hurry I just tried to reboot allspice, and then realized I hadn't installed dev, so it didn't see its new disk. I left allspice in single-user mode while I installed dev using mint. I then rebooted allspice. Anyway, that fenugreek didn't recover correctly is still a bug. There have been a few cases recently where machines don't seem to be pinging a server, so I'll look into it. Most machines seemed to recover ok. It was rather stressful on the system because I rebooted assault, then mint, then allspice. Perhaps a pagefault was waiting on recovery and somehow blocked enough things to prevent pinging. If both page faults and pinging are handled with Proc_CallFunc(), then this may be the problem. The proc_ServerProc's may be all used up waiting to page something in. 470. Date: Tue, 26 Sep 89 10:11:23 PDT From: Fred Douglis <douglis> Subject: allspice chucked my files Something very strange happened yesterday. My directory apparently got reverted to an older version. I had done an update from one directory into another, on /user1. I edited in that directory for an hour or two and then left. Allspice rebooted various times. This morning, my files were all as they were before I'd edited them, and the backup copies created by emacs didn't exist. My interpretation of this is that the directory was somehow reverted, so the inodes in the directory that pointed to backup versions were still valid under their original names. The one file I'd edited on two different machines was intact, however. That is, I did a lot of work on kvetching, then some work in the same directory a moment later on paprika, then eventually back to kvetching. I recommend that people look through /user1/lost+found to make sure nothing of theirs is missing. 471. Date: Tue, 26 Sep 89 11:08:37 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: update won't compile Update will not compile for the ds3100's. The existing update does not work right when run on a ds3100 when the files are on a ds3100 file server. I think the problems are due to Mike's changes to the a.out.h macros (N_TXTOFF and others). I will add this fix to my queue but want it recorded in case I forget. 472. Date: Tue, 26 Sep 89 12:07:28 PDT From: Fred Douglis <douglis> Subject: vm recovery problem i've started to get a clue about why various machines are wedging after allspice reboots. paprika was also wedged this morning. when i debugged it, i found a lot of processes waiting on the vm monitor lock, but the lock wasn't held. i poked around but couldn't find an explanation, so i finally continued the machine. surprisingly, it came out of its stupor, but only enough to complain about I/O errors in Fs_Dispatch, failed recovery with allspice, and finally a negative reference count on closing the swap file for one of the processes that bought it on a page-in error. 473. Date: Wed, 27 Sep 89 15:30:43 PDT From: Fred Douglis <douglis> Subject: restarting system calls from migration The migration database got locked again, and this time I was able to poke around while it was still locked. Turns out what's happening is a result of a change I made a few weeks ago to try to make migration transparent. Just as Fs_Read is really a C routine that makes a system call in a loop in case of interrupts, Fs_IOControl was changed to do this as well. That's because there were programs that would die because they got migrated during an ioctl and they got back an EINTR result they weren't expecting. On the other hand, it turns out that retrying ioctls that one would normally expect to abort (because of a real signal rather than a migration pseudo-signal) causes problems. For example, loadavg ends up retrying a blocking flock even after its alarm goes off, thereby sleeping forever. So, what to do? I suppose there's no easy way for user-level routines to find out what signal caused a system call to abort. I could special case migration by returning a different return status, which would be a pain, or I could add a system call to determine the last signal delivered, which would also be a pain. Any better ideas? This sort of issue has come up before, with respect to things like sigpause (a process blocks everything and thinks nothing that it can live through can cause it to get signalled -- KILL & such would blow it away -- and then migration causes it to get signalled. In that case, I could see what signal was pending for it and return a GEN_ABORTED_BY_SIGNAL when the only signal was migration-related; then the user-level routine would know to try again. A more general solution would certainly be preferable. 474. Date: Tue, 26 Sep 89 18:23:44 PDT From: brent (Brent Welch) Subject: Blocks => sector mapping broken The mapping from blocks to sectors is broken with the -scsi option to fscheck. It turns out that the mapping from file system blocks to disk sectors assumes that "rotational sets" completely take up a whole number of tracks. With the -scsi option to fscheck this isn't true, so the calculation of the 'firstSector' variable in the DiskBlock I/O routines is broken. We were just lucky with the other disks, and we weren't lucky with this one. With different geometries the bug will either overlap the rotational sets or it will separate them by some sectors. Obviously we are overlapping them in this case. (rotational sets are groupings of blocks where each block has a different rotational offset. The idea is/was to pack blocks onto sectors and get a skewed location between blokcs on different tracks, sort of like a brick wall where the ends of bricks on different layers don't line up. Each cylinder is divided into a number of rotational sets.) Here is the (broken) mapping: firstSector = geoPtr->sectorsPerTrack * geoPtr->numHeads * cylinder + /* wrong */ geoPtr->sectorsPerTrack * geoPtr->tracksPerRotSet * rotationalSet + geoPtr->blockOffset[blockNumber]; I'm not sure of the best way to fix this. Adding a sectorsPerRotSet to the Fs_Geometry structure would be best. However, this will be painful because the Fs_Geometry is written on the disk. We could write a utility that munges our headers to conform to a new Fs_Geometry structure, but that sounds rather exciting. Alternatively we could pitch the notion of rotational sets altogether, but again we have the problem of all our current disks built on the old mapping. Another approach would be to detect this situation and use a different mapping. The bad situation occurs when geoPtr->sectorsPerTrack * geoPtr->tracksPerRotSet < DISK_SECTORS_PER_BLOCK * geoPtr->blocksPerRotSet For example, the Wren IV disks on Oregano have: sectorsPerTrack 46 blocksPerRotSet 17 tracksPerRotSet 3 tracksPerCyl 9 each RotSet is allocated 46 * 3 sectors, or 138, and 17 blocks takes up 8 * 17 sectors, or 136. So there are two (wasted) sectors after each rotational set, 6 wasted in all However, with the Wren VI disk on Allspice: sectorsPerTrack 53 blocksPerRotSet 11 tracksPerRotSet 1 (!!!) tracksPerCyl 15 each RotSet is allocated 53 * 1 sectors, or 53 but 11 blocks takes up 88 sectors.... Finally, if the -noscsi option to fsmake is specified then my original logic will correctly fit the rotational sets onto whole numbers of tracks, but there might be more wasted sectors. I can't log into Allspice to see how much would be wasted because its login is in the debugger, and all attempts to rlogin suffer the same fate. 475. Date: Tue, 26 Sep 89 21:21:43 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: gdb on sun4 broken I was running gdb on allspice and was unable to step after I hit a breakpoint. I was running version 2.7? (is this needed anymore?) of gdb. 476. Date: Wed, 27 Sep 89 12:05:00 PDT From: pmchen (Peter M. Chen) Subject: error from ls mustard% ls *** compat: Cannot decode user status value 0xffffffff 262/ cmds/ leslie/ raid/ tmps 80col cmds.sun3/ library/ reminders tt* News/ conferences/ mail/ simul/ verses/ amdahl/ dead.article me@ spritereport viv/ bin/ dead.letter misc/ talks/ writeups/ c3/ donna/ notes/ tapes/ xtroff/ calendar info/ perf/ tmp/ I also had problems logging in from envy last night (and it didn't respond to pings). If you want to look at it, feel free (I'm going to be gone 'til 1:00pm). I'm going to reboot it at 1pm, though. Here's a look at the syslog: Broadcasting for server of "/sprite/src/kernel" RPC srvr 62c2c RPC srvr 62c2e Broadcasting for server of "/user2" RPC srvr 92c32 Broadcasting for server of "/spur2" RpcDoCall: <stat> RPC to oregano is hung <getIOAttr> 9/26/89 20:13:27 lust (1) RPC timed-out Fsrmt_GetIOAttr failed <30002>: device <0,0> at server 1 9/26/89 22:36:13 anise (49) rebooted <stat> RPC exit 0xffffffff Broadcasting for server of "/sprite2" 9/27/89 10:39:08 lust (1) rebooted 9/27/89 10:40:43 anise (49) rebooted 9/27/89 11:23:49 kvetching (2) rebooted 9/27/89 11:34:55 lust (1) rebooted 477. Date: Wed, 27 Sep 89 12:10:10 PDT From: Fred Douglis <douglis> Subject: Re: error from ls the hung rpc was because oregano's ipServer went into the debugger. I killed it and restarted oregano's daemons late last night. When it came back, things weren't quite right: I didn't recover /sprite2, and I got -1 status values (0xffffffff) for the things in progress at the time I killed the ipServer. I then killed and restarted the mount of /sprite2 by hand and things worked better. I didn't file a bug report on this because it seemed like the same problem Thorsten had recently when he couldn't reach an NFS disk, though perhaps this is a different problem after all. 478. Date: Wed, 27 Sep 89 16:25:40 PDT From: Fred Douglis <douglis> Subject: exec bug: trashing memory Thorsten repeatedly crashed his machine by accidentally invoking a shell script that called itself recursively with more args every time. This is on my to-do list, but I wanted to file the bug report to make sure I don't lose it and that no one else wastes time tracking down the bug. 479. Date: Wed, 27 Sep 89 17:31:44 PDT From: pmchen (Peter M. Chen) Subject: mail screwed up I'm having trouble mailing things out (they go out with a null message body). The message I was going to send was about pmake errors. 480. Date: Wed, 27 Sep 89 17:32:30 PDT From: pmchen (Peter M. Chen) Subject: rest of message FsrmtDeviceMigrate, server error <40012> Warning: ProcMigReceiveProcess: error returned by deencapsulation procedure Fs_DeencapFileState: the file handle is out of date. FsrmtDeviceMigrate, server error <40012> Warning: ProcMigReceiveProcess: error returned by deencapsulation procedure Fs_DeencapFileState: >> are some of the error messages I got. Also, make seems to be hanging. A couple hours ago, make didn't return at all (no error messages). Now, it gives the following errors: "/sprite/lib/pmake/command.mk", line 383: Warning: Malformed conditional (!empty(DISTDIR)) "Makefile", line 33: #if-less #else "/sprite/lib/pmake/command.mk", line 214: Warning: Extra command line for "MAKECMD" ignored "/sprite/lib/pmake/command.mk", line 215: Warning: Extra command line for "MAKECMD" ignored "/sprite/lib/pmake/command.mk", line 392: #if-less #endif 481. Date: Wed, 27 Sep 89 18:47:59 PDT From: shirriff (Ken Shirriff) Subject: mint ipServer hangs / gdb is useless The ipServer on mint went into the debugger again. The stack trace is status.go CvtFtoA( bunch of junk ) Mem_PrintStatsInt I tried to debug Mem_PrintStatsInt, but every time I tried to examine the variable "i", gdb went into the debugger, so I gave up. If anyone wants more details, it's on the console. 482. Date: Thu, 28 Sep 89 10:44:36 PDT From: douglis (Fred Douglis) Subject: ds3100 bug: mem_free kvetching crashed hard with a Mem_Free storage block already free -- wouldn't respond to the debugger though it said it entered it okay. if anyone else sees this please let me know. 483. Date: Thu, 28 Sep 89 11:43:58 PDT From: Fred Douglis <douglis> Subject: ds3100 X status I couldn't find anything that has changed in the past day or so, but nevertheless, X is suddenly broken. However, /ultrix/cmds/Xcfb.new works for me though Xcfb does not. Furthermore, its fonts are set up ok for the DEC fonts, though not for the MIT-compatible fonts (which are in their own directory with a different fonts.dir file that is compatible with the old format). Also, Xcfb still isn't giving me color. 484. Date: Thu, 28 Sep 89 12:25:12 PDT From: gibson (Garth Gibson) Subject: "ar" across NFS on basil (SPRITE VERSION 1.010 (sun3) (30 Aug 89 17:20:32)) in /spur/gibson/Csim I execute "ar q sun3.md/csim.a sun3.md/*.o" and it seems to hang (or at least make no real progress) for minutes if instead I do "ar q ~/csim.a sun3.md/*.o" it works nearly instantly why should ar hang when the object is across NFS ? Actually, I think it is the "q" argument (quickly append). If instead I do "ar r sun3.md/csim.a sun3.md/*.o" it runs in about 15 seconds even over NFS 485. Date: Thu, 28 Sep 89 13:08:14 PDT From: douglis (Fred Douglis) Subject: xkill kills X?fb.new I used xkill and got a segmentation violation in Xcfb.new. Since we don't have sources, I don't think there's much I can do. Whoopie! 486. Date: Fri, 29 Sep 89 10:46:29 PDT From: ouster (John Ousterhout) Subject: Out of space? I'm getting the following message in my syslog window, over and over: 9/29/89 10:45:40 allspice (14) RmtFile "mbox" <2,64776> Write-back failed: out of disk space But when I do a "df" there appears to be plenty of space on /user1. 487. Date: Fri, 29 Sep 89 11:12:54 PDT From: Fred Douglis <douglis> Subject: Re: wall i reported a bug a few weeks ago that there are hung rlogind processes that cause opens of /hosts/*/rlogin* to sometimes get hung. the wall process never gets past the open. the file system has to handle hung pdevs a little better, i guess. i think as a temporary measure i will change wall to do all the syslogs first, then go back and do the rlogin pdev files afterwards. maybe eventually it can fork a child that may or may not finish and time out, but it would be better to fix the problem in the kernel instead. 488. Date: Fri, 29 Sep 89 15:43:25 PDT From: tve (Thorsten von Eicken) Subject: /sprite/* aren't group sprite... It would certainly help if they were... 489. Date: Fri, 29 Sep 89 16:51:32 PDT From: douglis@ginger.berkeley.edu (Fred Douglis) Subject: mint deadlock after allspice wedged and was rebooted, it was mint's turn. no one could log in because access to /sprite/admin/lastLog was hung due to cache consistency. a single process was actually in the middle of an rpc to parsley, but parsley wasn't usable. parsley responded to pings. seems the timeout for client cache consistency didn't kick in, or something. brent: what happens if a client just decides to hang the call to start the consistency? I presume the timeout only starts once the rpc has finished and you're awaiting a callback from the client. parsley is in the debugger and i'll try to poke around once mint comes back, assuming i can login to my own machine successfully for a change. 490. Date: Fri, 29 Sep 89 17:44:02 PDT From: Fred Douglis <douglis> Subject: more on cache callback problems assault ran into the same problem -- it locked up ~douglis/.emacs. i debugged it and found it was in the middle of an rpc to hijack. hijack was actually not responding to rpc pings, and ken said it was continually printing the same statement to its syslog (this bug goes way back, eh?). when hijack rebooted and assault was continued, things got back to normal. 491. Date: Sat, 30 Sep 89 01:18:11 PDT From: tve (Thorsten von Eicken) Subject: makedepend -p not used in mkmf Why doesn't mkmf use the "-p" flag of makedepend? I run into trouble with that when I run pmake: it complains "Can't figure out how to make foo.h". The right -Idiretory flag is passed to makedepend. 492. Date: Sat, 30 Sep 89 01:36:34 PDT From: tve (Thorsten von Eicken) Subject: mkmf and #define no_install there should be a note in the man page about the possibility of #defining no_install in local.mk 493. Date: Sat, 30 Sep 89 01:42:23 PDT From: tve (Thorsten von Eicken) Subject: mkmf and makedepend, where does DEPFLAGS go? in /sprite/lib/pmake/command.mk is says at the beginning: # DEPFLAGS additional flags to pass to makedepend but these do not appear where makedepend in actually called. Maybe I'm blind (the whole mkmf stuff is pretty complicated...) but I'll try to fix it. I'll leave comments with the string "TvE" around so someone please check whether I goofed. Thanks, -TvE NB: anyway, I think the "-p" flag should always be passed to makedepend, I'll try to do that with "DEPFLAGS=-p"... 494. Date: Sat, 30 Sep 89 15:27:23 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: wall bug I rlogin'd to sage just a few moments ago and got the following wall from yesterday: sage<jhh 2> Broadcast message from douglis@kvetching.Berkeley.EDU at 17:30 ... time for assault to be debugged. /user2 will be unavailable temporarily.... Fred x29669 495. Date: Sat, 30 Sep 89 18:18:24 PDT From: mgbaker (Mary Gray Baker) Subject: non-existent FsStats referenced in spritemon Spritemon no longer compiles because it references a structure called FsStats which doesn't exist. Was this renamed in the file system renaming? 496. Date: Sun, 1 Oct 89 13:55:14 PDT From: ouster (John Ousterhout) Subject: Re: makedepend problems It's fine to use DEPFLAGS in makedepend calls, and it sounds like a bug that it wasn't there before. However, it sounds like Thorsten may not have done everything necessary to add the usage of DEPFLAGS. If DEPFLAGS are used, then they should default to empty to handle the normal case where they're not specified. In command.mk there is a group of lines that do this for other flags, like XCFLAGS, LINTFLAGS, and so on. Perhaps the best solution is to add DEPFLAGS back into command.mk, but also add a line DEPFLAGS ?= in the group of lines just after the "#include <tm.mk" line. 497. Date: Sun, 1 Oct 89 15:05:39 PDT From: ouster (John Ousterhout) Subject: Weird /mic behavior I noticed strange behavior with respect to /mic today... I'm not sure whether this is a bug or not. Mace has an old entry in its prefix table from last week when /mic existed on Allspice. At present, /mic is dismounted and unavailable (and Allspice has rebooted in there at some point too). I tried to cd to /mic, and saw two unusual things: 1. The following messages appeare in my syslog window: open of "/mic" waiting for recovery 10/1/89 14:49:13 allspice (14) RmtFile "/mic" <3,2> : stale handle 10/1/89 14:49:13 allspice (14) - recovering handles 10/1/89 14:49:13 allspice (14) RmtFile "/mic" <3,2> Reopen failed : domain unavailable 10/1/89 14:49:14 allspice (14) Recovery complete 140 handles reopened 10 failed reopens 2. The csh hung, and I had to kill it. Perhaps it makes sense for the csh to hang, since it's ostensibly waiting for /mic to become available, but I don't see why recovery should get invoked. This was repeatable: each time I tried to cd to /mic, recovery was invoked. Then I tried "ls /mic", and something different appeared in my syslog window: Fsprefix_HandleClose nuking "/mic" Broadcasting for server of "/mic" <prefix> 10/1/89 15:00:48 broadcast (0) RPC timed-out Now this seemed much more reasonable: the ls eventually quit with an error "/mic unreadable". At this point, "cd /mic" produced the same behavior, so apparently the ls unwedged something inside the kernel. Does "cd" behave differently than reading a file, and perhaps not invoke the right level of recovery actions? -John- 498. Date: Sun, 1 Oct 89 16:24:16 PDT From: ouster (John Ousterhout) Subject: Re: /sprite/lib/include/command.mk clears .PATH.h I forget the exact reason why the system .mk files clear .PATH.h, but I'm pretty sure it's necessary. I believe that it has to be done to guarantee a particular ordering of the include files, but it's been a long time since I've thought about this. You're right that it makes things tricky for local.mk files.... sigh. Some things in the local.mk have to be done BEFORE including the SYSMAKEFILE, and some things (like adding to .PATH.h) have to be done afterwards. It would probably be better to re-arrange the Makefiles some day so everything happens either before or after including the SYSMAKEFILE. As you've noticed, many of the Makefile features also aren't documented very well (they've gradually accreted over time). I wish there were a simpler way for all of this, (but given the complicated set of things we want the Makefiles to handle, I'm not sure there is). 499. Date: Sun, 1 Oct 89 17:23:26 PDT From: shirriff (Ken Shirriff) Subject: kgdb.sun4 is strange The editing controls no longer work correctly for kgdb.sun4. Backspace now does some strange nondestructive cursor motion function instead of performing the normal backspace function. 500. Date: Sun, 1 Oct 89 22:16:11 PDT From: douglis (Fred Douglis) Subject: rpcecho/rpccmd -ping rpcecho -h pride -d 16384 -n 1000 Rpc Send Test: N = 1000, Host = pride (6), size = 16384 N = 1000, Size = 16384, Time = 0.039671 rpccmd -ping pride -b 16384 Send 16384 bytes 0.020078 sec I assume the echo is bouncing the entire packet back again, huh? but the one-way ping doesn't have the same flexibility for repeating the test a variable number of times, etc. since these two programs do different things even though they look so similar, perhaps the documentation should be clearer? ("rpc send test => rpc bounce test" or something?) 501. Date: Sun, 1 Oct 89 23:39:48 PDT From: douglis (Fred Douglis) Subject: tx/pdev bug I held down ^A a bit to repeat the same command multiple times. tx died with the following: ReplyWithData couldn't send pdev reply; status "address given by the user for a system call was bad" 502. Date: Mon, 2 Oct 89 02:22:23 PDT From: douglis (Fred Douglis) Subject: pmake/migration bug w.r.t. high parallelism when pmake goes past about 10 parallel tasks, it seems to hang fairly reliably. no idea why yet. could be machine flakiness (i ran up to 10 based on an rlogin to hijack, then needed to use hijack too so ran the pmakes from kvetching, and that's when they started hanging. rebooting didn't help. still, 10 seems like a funny magic number...) 503. Date: Mon, 2 Oct 89 03:01:23 PDT From: douglis (Fred Douglis) Subject: new X too unstable I reported a bug the other day when xkill caused my Xcfb.new server to die, right? well, "xhost" generated an error when given a hostname, and caused the server to die when invoked with no arguments. 504. Date: Mon, 2 Oct 89 09:13:43 PDT From: brent (Brent Welch) Subject: Re: Weird /mic behavior The chdir() by csh does an open which goes through the regular recovery stuff in the prefix table routines. It appears, however, that the open wasn't correctly aborted when the recovery failed due to "domain unavailable". There is probably some bug associated with the failure to reestablish a prefix table entry. By the time the ls was done, then the prefix handle was already marked invalid, so the prefix was cleared and another broadcast was made. So, the difference between your two cases was not due to a difference between 'cd' and 'ls', but between the first use of the /mic domain and subsequent ones. The first case seems repeatable, and perhaps I'll have time to test it on assault or something. 505. Date: Mon, 2 Oct 89 09:19:59 PDT From: brent (Brent Welch) Subject: Re: rpcecho/rpccmd -ping rpcecho -s does a 'send' instead of an 'echo': Usage of command "rpcecho" -n: Number of RPCs to do Default value: 100 -d: Datasize to transmit Default value: 32 -D: Do tests at all sizes -e: Echo off RPC server (default) -r: Number of reps for each size Default value: 10 -s: Send instead of Echo -t: Trace records taken (runs slower) -c: High priority -h: name of target host -help: Print this message 506. Date: Mon, 2 Oct 89 09:22:52 PDT From: brent (Brent Welch) Subject: Re: tx/pdev bug ReplyWithData couldn't send pdev reply; status "address given by the user for a system call was bad" This is a known problem. If the user's buffer is bad tx gets an error and aborts. The pdev code needs to be fixed to determine which buffer (user's or server's) is bad. 507. Date: Mon, 2 Oct 89 13:00:06 PDT From: mgbaker (Mary Gray Baker) Subject: lots of icky sparc station stuff I knew it would be a useful exercise to try living on a sparc station... Lots of stuff seems to have gone haywire since the last time I tried a lot of this. And some of these are continuing bugs. 1) The machine gets in a mode sometimes from a particular csh window where everything exec'd from the csh gets a seg fault. This is horrible since it probably means something about caches or register windows not being flushed at the right time. Brent noticed this happening once on a regular sun4 if I'm not mistaken, so this isn't just a sparc station problem. This did not happen before, so something has changed to create this mess. 2) Vi keeps forgetting its TERMCAP and using open mode. I reported this bug before. 3) Some X applications, such as xclock, keep dying in XtConvert(). 4) It's sometimes hard to debug user programs with seg faults, since the debugger often seg faults on them. When I can debug them though, it appears there was no reason for them to seg fault where they did. This again points to a cache or register window flushing problem that isn't updating the stack at the right time... 508. Date: Mon, 2 Oct 89 13:33:00 PDT From: eklee (Edward K. Lee) Subject: missing directory One of my directories /sprite/users/eklee/cmds.md seems to have mysteriously vanished. It was there Friday but not today. I'm not sure when it was last modified (probably a long time ago. 509. Date: Tue, 3 Oct 89 13:54:35 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: unknown problem with thyme Thyme got very sluggish on me and a ps -au reveiled a process in the UNUSD state using 47.7% of the cpu. I put thyme into the debugger but was unable to attach to it from allspice. It also ignores kmsg -c requests. Thyme was running kernel 1.023. I don't think there were any migrations in progress. File this one away for future reference. 510. Date: Tue, 3 Oct 89 15:28:12 PDT From: pmchen (Peter M. Chen) Subject: corrupted file My mailbox got corrupted sometime (don't know when): Any ideas of what happened? I left a copy of the file in ~pmchen/tmp/corruptedmail 511. Date: Tue, 03 Oct 89 16:28:18 PDT From: rab (Robert A. Bruce) Subject: piquante Piquante is in the debugger: Fatal Error: Software time is ahead of the hardware 512. Date: Tue, 3 Oct 89 16:37:22 PDT From: brent (Brent Welch) Subject: Allspice cache crash Allspice died in the block cache. It apparently found a block associated with a previous incarnation of a domain. John H. had unmounted a file system and remounted it under a different name. I believe that the unmount left a block in a funny state in the cache. It was an indirect block, or perhaps a block of file descriptors - it thought it was associated with the "physHandle" of the domain, which is used for indirect blocks and file descriptors. However, while the block referenced the physHandle, the physHandle didn't reference the block. A panic occurred when DeleteBlock tried to take this block away from the physHandle. More details: the block was in the LRU list, and it was found by FetchBlock. FetchBlock called DeleteBlock in order to take the block away from its current owner. DeleteBlock found the block in the hash table, but it died trying to remove it from the per-file block list (or indirect block list). This is code I have stared at in the past. There is no obvious place where things could easily get out of wack, but it is all rather complex and not obviously correct either. I did glance at the Unmount code, and there doesn't seem to be any particular attention payed to the cache. A write-back is done, but there are no consistency checks made on the physHandle associated with the domain. Checks should be added - the unmount code is probably the least used code we have. 513. Date: Tue, 3 Oct 89 20:17:11 PDT From: pmchen (Peter M. Chen) Subject: transient bug in floating point? About 15 minutes ago I compiled a program which had always run fine and got an odd error from a print statement printf("tot1=%d, tot=%d, i=%d\n",tot1,tot,i); printf("%.2lf %% requests fulfilled in %d ms\n", (double)tot1*100.0/tot,i); printf("%d %lf %d\n",i,(double)tot1*100.0/tot,i); produced something like: tot1=300, tot=301, i=40 99.67 % requests fulfilled in 120385833 ms 40 99.6666667 120385833 I'm making this up because I don't have the real output when the program was doing this (so the 120385833 is fudged). But it did give garbage there instead of "40". It looks like the results of the floating point is wrecking the next argument to printf. I've recompiled it many times and it did this consistently (on a sun3). Then I moved to a sun4 and it worked fine. After this, I moved the routine to a separate module and recompiled (on the sun3) and it works fine now. I am not compiling with hardware support. The program is ~pmchen/raid/mult and the offending routine is printlat (in printlat.c). 514. Date: Wed, 04 Oct 89 09:54:34 PDT From: rab (Robert A. Bruce) Subject: pepper Pepper is in the debugger: Fatal Error: Trying to broadcast non-prefix 515. Date: Wed, 4 Oct 89 12:29:19 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: uwm dies My uwm dies using Xmfb.new. It doesn't go into the debugger, it just goes away. Any new ones I start just sit there and do nothing. 516. Date: Wed, 4 Oct 89 16:41:20 PDT From: brent (Brent Welch) Subject: File server lock-out You can fully occupy the attention of a Sprite file server by writing a huge file. The new SCSI interface happily queues up a zillion blocks, and then the SCSI interrupt handler chains through the blocks writing each one. In the meantime the server doesn't do much else. I noticed this the other day when pounding on assault, and it happened again today when John H tried to write a huge file to test out a new disk. My innocent editor write-back hung until his job was aborted. You can also experience this by trying to use Oregano as a workstation. I haven't fully diagnosed the problem with the debugger or anything, but I think that between the disk interrupts and the block cleaner things are effectively blocked out of the file system cache. I'm not sure exactly, but perhaps my write couldn't complete because the server couldn't read an indirect block until the file currently being written out cleared the disk queue. Adding interrupt priorities would only help mouse response when the disk is busy, and perhaps this isn't that important. I'm not sure what to do about the disk queue. Perhaps we can throttle the block cleaner so it only does N blocks of a file at a time (the cleaning is done on a per-file basis) so that other cache I/O's can slip in. This is much like the old problem we had where the disk queuing wasn't fair at all, and once the block cleaner got a hold of it it didn't let go until it was done. Now the block cleaner is free to queue up the whole cache! 517. Date: Wed, 4 Oct 89 17:50:50 PDT From: tve (Thorsten von Eicken) Subject: ds3100 ld spits out "LINK EDITOR MAP" on "ld -r" Yeah, I have a "bigcmd" directory. I type mkmf and pmake and at the end when it comes to the link, it does it and then spits the LINK EDITOR MAP at me. Is this a feature? 518. Date: Wed, 4 Oct 89 23:53:09 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: assault runs out of memory Assault runs out of memory if you get too many file handles. 519. Date: Thu, 5 Oct 89 13:19:51 PDT From: brent (Brent Welch) Subject: signal/proc deadlock Garth found Basil in a deadlock today. I hunted around for a while and deduced that there was a deadlock between the Sig:sigLock and the Proc:tableBlock. I didn't fully figure the deadlock out, as I simply stopped after spending a half an hour or so looking around. Basil had many processes in the debug state, by the way. There were also a coupld processes trying to send signals, including an Rpc_Server from some remote host. Finally, the Xsprite process was locked, but I could't quite figure out who had it locked. With the 'holderPC' and 'holderPCBPtr' we ought to have enough information to figure these deadlocks out. (In fact, having this really helps a lot.) However, it is still tedious although slightly less time consuming. Is hopeless to hope for improved debugger support? I am fearful that the difficult bugs in Sprite will not be solvable in our current environment, especially as the experts/implementors begin to leave. This is a strong plea for better attention to the debugging facilities. For exmaple, it is still probablistic whether you can examine a local variable in gdb. Sometimes you just get "Error: invalid address 0". It is also painful to examine 30+ processes to determine what the deadlock is. Or, for another example, if a machine hangs while trying to enter the debugger (i.e. the cache-lock is held so you can't sync the disks) then you have to manually scan through all the processes and see which one got the panic. It is little things like this that conspire against good debugging. It's too bad that none of us want to work to improve the debugging environment (hint hint). I think there is lots of room for improvement. Flame off. 520. Date: Fri, 6 Oct 89 08:39:34 PDT From: ouster (John Ousterhout) Subject: Crash and disk space When I came in today Allspice was catatonic: it didn't respond to its keyboard at all and wasn't responding to rpc requests. I gave up and rebooted it. Also, disk space was empty on /sprite/src/kernel. In order to unwedge Mace (which was apparently hung trying to write back something from a migrated process), I deleted the sun3.1.023 kernel (it didn't appear to me to be in use any more). 521. Date: Fri, 6 Oct 89 12:06:28 PDT From: mgbaker (Mary Gray Baker) Subject: tx window in the debugger My tx window with a long-standing kernel debugging session in it just went into the debugger. I don't think I did anything weird except that I typed a return key in it for the first time after a number of hours. 522. Date: Fri, 6 Oct 89 16:24:30 PDT From: brent (Brent Welch) Subject: Mint crash Friday As you probably know, mint had a rough afternoon on Friday. The underlying cause is that the bug I attempted to fix concerning scavenging a handle for a file that is being deleted was not fixed, apparently. Mint was deleting a file in /tmp and got a bus error because a handle didn't have a file descriptor attached to it (a sign of scavenging). Interestingly, fscheck didn't complain (this time) about the file that was in the process of being deleted. Mint then had troubles during recovery. After the very first round of re-opens it simply hung - lots of processes in the ready state, and an lpd process in the running state. I rebooted, and this time fscheck found that the tmp file which caused the first crash referenced a non-allocated file descriptor. Anyway, towards the very end of recovery #2 mint crashed again, this time with a different bug related to local file handles, another one I had thought I'd fixed. This bug concerns what happens when the handle table fills up - there is a window of time where a handle is partially installed, and apparently the wrong guy got it back. (That's a hand-wavy explaination. The problem is probably in Fsutil_HandleInstall.) Now for the fun part. The next reboot sequence failed with the following message: Unknown user brent (!!) It turns out that /etc/passwd got truncated (yow!), I was the owner of /sprite/cmds/csh, and csh couldn't execute the /boot/bootcmds script because of no /etc/passwd. Luckily we could access the other servers from the single user shell, and we copied /t1/etc/passwd to /etc/passwd, sourced the boot script, and we seemed to be back in business. The third time is the charm, as they say, and mint was able to make it through recovery ok. I'll go look at my brain-damaged code that concerns local file handles, as mint crashed in two different ways in this area. 523. Date: Sun, 8 Oct 89 10:32:49 PDT From: ouster (John Ousterhout) Subject: Mint crash When I came in this morning Mint was not responding to RPC requests. I went up to the machine room and discovered that Allspice was out of disk space on /user1, and Mint had used up all its console paper printing out disk full messages for files it was trying to write to /user1. This apparently had hung Mint? I added more paper to the console, at which point Mint printed a bunch of unintelligble garbage on the console and then went catatonic (no response whatsoever to the console). At this point I rebooted Mint. Unfortunately, many of the clients did not recover ("Recovery failed <30002>"). I then rebooted Mint a second time, but many clients still didn't recover. Fortunately, piracy was one of the lucky ones. I then used piracy to free up disk space on /user1, and when I did that Mace then recovered. I don't know whether the lack of disk space somehow impacted recovery or this was just a coincidence. 524. Date: Sun, 8 Oct 89 13:39:27 PDT From: ouster (John Ousterhout) Subject: Kgdb and registers It doesn't appear to be possible to set register values from Kgdb. When Mendel and I tried this today we ended up with the value "4" in the register, which wasn't at all what we thought we were storing. 525. Date: Sun, 8 Oct 89 13:41:19 PDT From: ouster (John Ousterhout) Subject: Sun-4, interrupts, and debugging If a Sun-4 is forced into the debugger with "kmsg -d", and is then debugged with kgdb, kgdb does not correctly identify the stack frame that was active when the network interrupt occurred. This makes it very hard to locate an infinite loop in the kernel, for example. Mary, can you fix the interrupt code to fudge enough information on the stack so that Kgdb can correctly identify the frame that was interrupted? 526. Date: Sun, 8 Oct 89 20:10:11 PDT From: pmchen (Peter M. Chen) Subject: pmake I get the following error message from pmake clean --- tidy --- rm -f %(sh: syntax error at line 1: `(' unexpected *** Error code 2 pmake: 1 error I had just 'pm mkmf'-ed this directory. The offending directory is ~pmchen/simul, and this error occurred on anise and on mustard (with TM=sun4). 527. Date: Sun, 8 Oct 89 20:15:26 PDT From: pmchen (Peter M. Chen) Subject: floating point error? I have another program with really weird errors. Floating point variables get changed by miscellaneous program statements (such as a printf statement). This happens on the sun3's (mustard), compiled with hardware floating point. It doesn't happen on the ds3100's. I don't know whether it happens on the sun4's or not (see previous message to bugs about sun4 pmake problems). The problem does NOT happen using software floating point on the sun3's. The program is ~pmchen/simul/simul. You can produce the error with simul -d 1 -q 1 -i 2 -r 0 Watch for the NaN outputs. 528. Date: Fri, 6 Oct 89 10:09:12 PDT From: pmchen (Peter M. Chen) Subject: problem in allspice I am using the Sprite FS in, shall we say, out of the ordinary ways: ie. writing thousands of files to one directory. I was running simulations on parsley which output lots of small files to ~pmchen/simul/out/small. The csh script I ran is in ~pmchen/simul/ex/small. This ran fine (to completion) last night on parsley, but might be the cause of the problems this morning. As per instructed by John O., I F1-A'ed parsley so we could see if allspice stays up for a while. Of course, Randy's machine is thus unavailable. 529. Date: Mon, 09 Oct 89 06:34:21 PDT From: rab (Robert A. Bruce) Subject: allspice When I came in this morning allspice was frozen. It didn't respond to the keyboard or to the network. There were no error messages on the screen. /user1 was being dumped when it died. 530. Date: Mon, 9 Oct 89 12:28:00 PDT From: mgbaker (Mary Gray Baker) Subject: Re: Sun-4, interrupts, and debugging [I sent this yesterday, but it seems that at least neither Fred nor Mendel got it. I think something went wrong with fenugreek's sendmail or whatever.] It sounds to me like people don't have quite the picture of how the register windows and stack frames work on the sun4. The problem is not in the kernel. We can easily fix the problem, and will do so, but it shouldn't mean changing what's in a trap frame, and there's really no such thing as "fudging enough information" since an interrupt frame is just a trap frame on the sun4 (because interrupts are just asyncronous traps on the sun4). I think everybody agreed this was a nice clean way of doing it and changing this right now would involve reworking a lot of stuff. Here's what the debugger is getting confused about: as it traces back along the stack, looking at each frame as if it's a C call frame, it looks for the pc of the calling routine in %i7. This is %o7 of the previous register window. If a trap occurs, the register window gets bumped forward one (by the hardware) and various values are stuffed into registers in the new register window (by the hardware). It's this trap frame that the debugger sees. The problem is that the pc of where the trap occurred gets put into %l1 (by the hardware) instead of into %i7. This confuses the debugger since it doesn't special-case the trap frame. But I can't stuff the pc into %i7, since that's part of the state we can't overwrite. So, in %i7, the debugger usually finds the pc of the routine that last made a procedure call from that window. What we can do instead, is have the debugger recognize the range of pc's for the trap (and interrupt) handlers, and if it finds such a pc in a %i7, it can special-case what to do with the stack frame before that, since it will be a trap frame and not a C call frame. 531. Date: Tue, 10 Oct 89 23:01:11 PDT From: shirriff (Ken Shirriff) Subject: Bug in Proc_AddMigDependency? Proc_AddMigDependency (procMigrate.c), line 182, calls HashFind(table, (Address) processID), which calls Hash, which uses the second argument as a pointer to the string to hash. Since the processID doesn't point to a valid string, this crashes. This happened when I tried to do a pmake running a new kernel of mine. The stack trace is MachSysCall->MachUserReturn->Sig_Handle->Proc_MigrateTrap ->Proc_AddMigDependency->Hash_Find->hash.Hash. As far as I can tell this bug has always been in there, but I don't know why things have worked up until now. Maybe my kernel is confusing something? 532. Date: Tue, 10 Oct 89 23:56:22 PDT From: Fred Douglis <douglis> Subject: ds3100 flakiness returns things are acting weird again. for example, a couple of times today i had cc's returning exit statuses of 1 with no warnings, where a recompile went fine; i had one set of cc's complain about typedefs not existing when they were fine (again, recompiling worked fine); and finally i spent a half hour trying to boot a new kernel, hitting "Enabling timer interrupts" early in the boot sequence and then dying. I tried different combinations of reset+init+bootpath+etc without help. finally i relinked my kernel and it worked just fine. 533. Date: Wed, 11 Oct 89 14:14:30 PDT From: Fred Douglis <douglis> Subject: repeated recovery when mint froze up before, i got a bunch of "cacheable/busy" conflict messages and then recovery over and over. Finally, once things started to clear up, I was down to a tight loop of recovery followed by a stale handle on a file that was accessed by a process that went into the debugger as soon as mint started responding again. I'll send brent my syslog with a copy to the sprite-log -- no need to burden everyone else with it, since it's very long. 534. Date: Wed, 11 Oct 89 20:57:50 PDT From: Fred Douglis <douglis> Subject: /dev/syslog truncation bug I was able to test out my syslog change on sun4s, and while trying to exercise the bug I ran into something else. It seemed that if I suspended something reading /dev/syslog, and I wrote lots of stuff to syslog in one operation, I could overflow the syslog and cause an old kernel to go into an infinite loop as expected. But, both old and new kernels had another problem: if I said "cat xyz > /dev/syslog" repeatedly, each one would overwrite the previous one rather than filling the buffer and overflowing! After lots of head scratching I found out that the ioctl interface for syslog clears the buffer, and csh opens /dev/syslog with truncation set. This means that it would be possible to lose stuff from the syslog if it got truncated before the reader got in to get the data. I'm going to remove support for IOC_TRUNCATE; speak up if you can think of a case to reinstate it. 535. Date: Wed, 11 Oct 89 21:39:27 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: evil black blob lives!! I've got one of those nasty black blobs that extends from my cursor to the right edge of my tx window on hijack. I was under the impression this was fixed, but evidently the blob knows differently. It is now immune to 'clear'. 536. Date: Thu, 12 Oct 89 10:29:02 PDT From: Fred Douglis <douglis> Subject: large selection doesn't work If I select a large region, and then use "select" to write it to a file, nothing gets produced. I'm pretty sure this worked as of a few weeks ago. If I select several lines at a time, things work okay. 537. Date: Thu, 12 Oct 89 11:27:33 PDT From: tve (Thorsten von Eicken) Subject: something wrong with mail: /sprite/spool/mqueue not found On the sun4's (burble, allspice) shortly after I send mail, I get an error message on my tty saying: queuename: Cannot create "qf~Z210967" in "/sprite/spool/mqueue": no such file or directory This does not happen on ds3100, (nor sun3s I think). 538. Date: Wed, 11 Oct 89 16:32:06 PDT From: mgbaker (Mary Gray Baker) Subject: ranlib dies on sun4 Ranlib gets a segfault on the sun4 in the routine stash() at line 309 when it dereferences s->n_un.n_name. The address is out of bounds (0xfe15280c). 539. Date: Thu, 12 Oct 89 13:01:40 PDT From: ouster (John Ousterhout) Subject: Mail file trashed The last few bytes of my mail file got lost today. The result was a partial header from Mary, followed by a header and message from a 60B student. By the time I noticed it, the mail file had already been modified a couple of times, so I didn't bother to save the damaged copy. Mary, if the message you sent just after the one about "tx search dies on a sun4" is important for me to see, could you resend it? 540. Date: Thu, 12 Oct 89 13:35:16 PDT From: mendel (Mendel Rosenblum) Subject: wall kills rlogin Brent's last wall message terminated a rlogin from murder to anise. The message: anise% df . Prefix Server KBytes Used Avail % Used /mnt anise 284000 3148 252452 1% anise% Broadcast message from brent@oregano.Berkeley.EDU at 13:16 ... Sayonara - rebooting after 20 days of uptime to test recovery and the new kernel 541. Date: Thu, 12 Oct 89 13:36:12 PDT From: mendel (Mendel Rosenblum) Subject: wall kills rlogin Brent's last wall message terminated a rlogin from murder to anise. The message: anise% df . Prefix Server KBytes Used Avail % Used /mnt anise 284000 3148 252452 1% anise% Broadcast message from brent@oregano.Berkeley.EDU at 13:16 ... Sayonara - rebooting after 20 days of uptime to test recovery and the new kernel PdevServiceRequest: bad request on request stream: 540095032 Connection closed. murder% 542. Date: Thu, 12 Oct 89 17:53:21 PDT From: brent (Brent Welch) Subject: FS deadlock found I think I have figured out the deadlock that has killed mint the past few times. It occurs during times of heavy load because a client responds to a call-back too fast, and locks are aquired (released, actually) in the wrong order. I need to take off for dinner, but it would be nice if I could have some time to truely verify this deadlock (by scouring the code some more) and figure out a correct fix for the new .new kernels. If Mary wants to use things as is and reboot Allspice with a better sun4 kernel (perhaps sun4.mgbaker) that would be ok. Currently mint and oregano are running sun3.brent (BW.151) which has my other RPC/RECOV/FS fixes in. 543. Date: Thu, 12 Oct 89 18:01:48 PDT From: mendel (Mendel Rosenblum) Subject: slow source listing in gdb.new The reason that the new gdb lists source lines so slowly on Sprite is that it calls the library routine isatty() for each character displayed. On unix the isatty() routine takes around 100-200 microseconds while it takes 2-4 milliseconds on Sprite. The reason is that Sprite forwards the ioctl to the terminal driver using pdevs. 544. Date: Thu, 12 Oct 89 18:37:45 PDT From: mendel (Mendel Rosenblum) Subject: cc1.68k dies cc1.68k dies on the following code fragment from the net module. NetIERecvUnitInit() { volatile struct { char recvUnitStatus:7 ; } *scbPtr; scbPtr->recvUnitStatus; } 545. Date: Thu, 12 Oct 89 18:44:44 PDT From: mgbaker (Mary Gray Baker) Subject: ipServer and deadlock The ipServer on covet died. When I killed the inetd and ipServer in preparation to restart the ipServer, covet went into the debugger with deadlock on schedMutex. I wrote down the pc, etc, in case anyone is interested. 546. Date: Fri, 13 Oct 89 11:17:27 PDT From: brent (Brent Welch) Subject: Re: vmPageTableInc bug was List problem I added a list and wasn't using the List macros right, which resulted in me trashing vmPageTableInc. I seem to do this everytime I add a new list, because if you aren't careful you end up using the list header as a list element. The List_ macros are happy to return you the list header, which is dangerous. If you don't use LIST_FORALL, you have to use the following code sequences to get the first element, then the next: /* Get the first element of the list, or NIL if the list is empty */ if (List_IsEmpty(recovPingList)) { pingPtr = (RecovPing *)NIL; } else { pingPtr = (RecovPing *)List_First(recovPingList); } /* Get the next element of the list, or NIL if at the end of the list */ pingPtr = (RecovPing *)List_Next((List_Links *)pingPtr); if (List_IsAtEnd(recovPingList, (List_Links *)pingPtr)) { pingPtr = (RecovPing *)NIL; } brent ps. You can't use LIST_FORALL if the list can change dynamically. In this case I have a list that can grow do I use a monitor to control list iteration and addition of items to the list. Anyway, I ended up using the list header as a list element.... 547. Date: Fri, 13 Oct 89 11:45:54 PDT From: ouster (John Ousterhout) Subject: Second gateway I sent mail to Herve DaCosta asking about getting a second gateway out of the SPUR net to replace ji. There's already a machine in the works for this, called "csgw2". It should be on-line in the not-too-distant future. On a related note, Brian Shiratsuki asked if Sprite is capable of switching name servers if the first choice doesn't respond. I don't know if we do this, but if it isn't hard to implement it seems like a good idea. Thus if csgw is down we could switch to ginger or csgw2. 548. Date: Fri, 13 Oct 89 12:03:22 PDT From: Fred Douglis <douglis> Subject: profiling broken user-level profiling (on sun3s) is not recording run-time PC sampling. I can get a call graph but not how much time is spent in each routine. (I've talked to Bob about this, but I wanted to file an official bug report too.) 549. Date: Fri, 13 Oct 89 12:21:55 PDT From: tve (Thorsten von Eicken) Subject: lost mail to bugs 'cause of mail problem (the /sprite/spool/mqueue not found on sun4's stuff...) I'll remail everything, pardon if somethig arrives twice. 550. Date: Fri, 13 Oct 89 12:23:55 PDT From: tve (Thorsten von Eicken) Subject: The mail problem on sun4's (It always says something like: queuename: Cannot create "qf~Z275756" in "/sprite/spool/mqueue": no such file or directory ) I guess it has to do with /sprite/spool/mqueue being owned by root, group wheel and NOT world-writable. 551. Date: Fri, 13 Oct 89 12:26:27 PDT From: tve (Thorsten von Eicken) Subject: group sprite I know it's a pain to keep track of what group files belong to, but: if someday the world gets reorganized (with the new disks), could the person(s) doing that take care of the group files/dirs get into? Those who don't have a "su" window on their screen will thank you! (hehe..) 552. Date: Fri, 13 Oct 89 13:11:07 PDT From: Fred Douglis <douglis> Subject: _extendsfdf2 missing I tried to link a new copy of something using libc_p. it found everything but _extendsfdf2. I looked for this in libc and saw that there was an object file in gnulib/sun3.md/oldobjs but nothing in sun3.md itself. _extendsfdf2.po is a link to a nonexistent _extendsfdf2.o in sun3.md. i suspect if we were to remove libc.a at this point and remake the library from scratch (as is done every so often), a lot of programs might not link anymore. i just checked for other missing links, and _builtin_new, _lshrsi3, _subsf3, and _varargs all suffer from the same problem. anyone know what happened here? 553. Date: Fri, 13 Oct 89 13:22:11 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: fs bug Oregano just printed the following to its syslog: BlockIOProc: firstSector(1862854) > lastSector (630107) BlockIOProc: firstSector(1862854) > lastSector (630107) ... BlockIOProc: firstSector(7803064) > lastSector (630107) BlockIOProc: firstSector(4644646) > lastSector (630107) Somebody thought the disk was bigger than it actually was. It looks like BlockIOProc returns SUCCESS in this case. Why doesn't it panic, or at least return failure? 554. Date: Fri, 13 Oct 89 13:36:51 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: rlogin trashed /sprite/cmds.sun3/rlogin was overwritten with garbage at about 1:00 pm. I noticed at about 12:59, at which point the descriptor had been modified at 12:58:20. The last descriptor modified time was 13:05:02. I've moved the file to /sprite/trashed. I can't make any sense of its current contents so I have no idea who did it. 555. Date: Fri, 13 Oct 89 13:41:32 PDT From: Fred Douglis <douglis> Subject: Re: rlogin trashed the first string in the trashed file is a line from the loadavg daemon. looks like recovery got confused. in fact, i'll bet i know why: fenugreek was in the debugger, and i wanted to use it, and i had no idea why brent (?) threw it into the debugger around 8am today so i figured i'd continue it and see what happened. that's about the time the problem arose, now that i think of it. also, rlogin was continually being updated. the string occurs at offset 0, which is odd. i would expect it to be offset (8*187), which would be host 8's entry in the database file, or at offset 0 in /hosts/fenugreek/migInfo, which is in a different domain. 556. Date: Fri, 13 Oct 89 13:48:36 PDT From: brent (Brent Welch) Subject: Re: fs bug firstSector > lastSector BlockIOProc: firstSector(4644646) > lastSector (630107) Somebody thought the disk was bigger than it actually was. It looks like BlockIOProc returns SUCCESS in this case. Why doesn't it panic, or at least return failure? The server shouldn't panic, of course. What it does is return SUCCESS and zero bytes transferred, because this emulates what happens when you try to read past end-of-file. 557. Date: Fri, 13 Oct 89 13:49:15 PDT From: rab (Robert A. Bruce) Subject: dump The tape drive isn't working. When I try to access it I get /hosts/murder/dev/exabyte.norewind: connection timed out and this message appears on murder's console: Warning: SCSI3 can't select SCSI3#0 Target 5 LUN 0 I checked all the cables and everything seems to be okay. I tried power cycling the tape drive, and tried a couple different tapes. Then I tried booting an old kernel, but that didn't help either. Since the tape didn't work, I put this morning's dump into /t6/dump.lev1.13Oct. 558. Date: Fri, 13 Oct 89 13:55:29 PDT From: pmchen (Peter M. Chen) Subject: decstation cc error I was in ~pmchen/verses/verse, and issued pm on forgery. Here's what happened: forgery% pm --- ds3100.md/verse.o --- rm -f ds3100.md/verse.o cc -g3 -O -Dds3100 -Dsprite -Uultrix -I/users/pmchen/lib/include -I. -Ids3100.md -I/sprite/lib/include -I/sprite/lib/include/ds3100.md -c verse.c -o ds3100.md/verse.o ccom: Warning: verse.c, line 140: statement not reached endwin(); ------------^ (ccom): verse.c, line 141: ccom: Internal: schain botch } ^ *** Error code 1 pmake: 1 error The same compile worked fine on nutmeg. Any ideas? Do we have the dec compiler? 559. Date: Fri, 13 Oct 89 14:11:58 PDT From: Fred Douglis <douglis> Subject: sendmail this is because thorsten was using an invalid "option" (Mail foo -c bar) that confused sendmail. sendmail works fine normally even if a user is unknown. there is a bug when sending to recipient "-c" but this isn't related to sprite. 560. Date: Fri, 13 Oct 89 14:44:11 PDT From: tve (Thorsten von Eicken) Subject: flaky size on /bin/ls -ls can someone explain the following (happens on ds3100 & sun4c, dunno sun3) [gluttony tve] /bin/ls -ls worm-pipe 76 -rw-rw-r-- 1 tve 72175 Oct 13 14:36 worm-pipe [gluttony tve] cp worm-pipe foo [gluttony tve] ls -ls worm-pipe foo 71 -rw-rw-r-- 1 tve 72175 Oct 13 14:42 foo 76 -rw-rw-r-- 1 tve 72175 Oct 13 14:36 worm-pipe [gluttony tve] diff foo worm-pipe [gluttony tve] 561. Date: Fri, 13 Oct 89 14:45:26 PDT From: tve (Thorsten von Eicken) Subject: more flaky /bin/ls -ls sorry, forgot to mention that an /bin/ls -ls after the diff yields: [gluttony tve] ls -ls worm-pipe foo 76 -rw-rw-r-- 1 tve mic 72175 Oct 13 14:42 foo 76 -rw-rw-r-- 1 tve mic 72175 Oct 13 14:36 worm-pipe 562. Date: Fri, 13 Oct 89 14:49:11 PDT From: tve (Thorsten von Eicken) Subject: uncompress didn't work on sun4's (fixed) compress did. I recompiled and reinstalled /a/attcmds/compress for sun4s. 563. Date: Fri, 13 Oct 89 15:22:22 PDT From: brent (Brent Welch) Subject: Re: more flaky /bin/ls -ls You are experiencing the delayed-write caching of Sprite. The indirect blocks are not allocated to the file until it is written to disk, so they don't show up in the block count until sometime after the file is created. If write-back caching worries you, remember that all Sprite editors use fsync(), which really and truely forces files to disk. 564. Date: Tue, 10 Oct 89 01:56:12 PDT From: tve (Thorsten von Eicken) Subject: ds3100 cc seems to define "ultrix" I know why this is so... I just wanted to point this out in case someone ports software which uses #defines ... The search for ..../include/sys/limits.h was dependent on ultrix being defined, so maybe one can ignore my previous message!? 565. Date: Tue, 10 Oct 89 08:48:24 PDT From: brent (Brent Welch) Subject: RPC error Thyme crashed while handling an open() because it got an errant RPC reply from the server. I've seen this before. The RPC trace shows the problem: c3cc0 out 0.0000 Q 32 14 26 6 get attr 16 0 0 0 0 500 c3cc0 in 0.0000 R 32 14 26 6 get attr 112 0 0 0 0 500 c3cc1 out 0.0100 Q 32 14 26 6 open 92 15 0 0 0 500 c3cc1 in 0.0100 R 32 14 26 6 open 112 0 0 0 0 500 c3cc2 out 0.0000 Q 32 14 26 6 get attr 16 0 0 0 0 500 c3cc2 out 0.1000 Qp 32 14 26 6 get attr 16 0 0 0 0 500 c3cc2 in 0.0000 R 32 14 26 6 get attr 112 0 0 0 0 500 c3cc3 out 0.0100 Q 32 14 26 6 open 92 15 0 0 0 500 c3cc3 in 0.0000 R 32 14 26 6 get attr 112 0 0 0 0 500 c3cc3 in 0.0000 R 32 14 26 6 open 112 0 0 0 0 500 See how RPC c3cc3 gets a "get attr" reply instead of an "open" reply. Apparently thyme resent its "get attr" request at about the same time that mint replied. Then, after it issued its open request it picked up the retransmitted RPC "get attr" reply instead of the open reply. My hunch is that perhaps the "get attr" reply was sitting in thyme's input buffer already, at the time the open request was issued, and the client dispatcher is erroneously picking it up. 566. Date: Tue, 10 Oct 89 08:56:20 PDT From: douglis (Fred Douglis) Subject: loadavg recovery problem After the file servers rebooted, i noticed that "finger" didn't list many people. turns out several hosts were listed as down. this was still true after about a half hour. logging into them must have triggered recovery, however, since within a minute of logging into the two i tried out, they were listed as up again. 567. Date: Tue, 10 Oct 89 10:10:55 PDT From: Fred Douglis <douglis> Subject: repeating console write bug found ... I hope. Turns out that when the buffer overflowed in Dev_SyslogWrite, it wouldn't subtract the amount written directly to the console, so it would return that 0 bytes were written and Fs_Write would try again. My reasoning is that this would happen anytime a user process wrote to /dev/syslog when the buffer was full (but not for printfs in the kernel, which is why we don't see the problem more often). I'm remaking dev and will include this fix in the new kernels I'm going to build today. I hope to push this stuff out to "new" as quickly as possible since I want to start gathering statistics anyway. 568. Date: Sat, 14 Oct 89 12:54:50 PDT From: mgbaker (Mary Gray Baker) Subject: Something funny with /dev/syslog? If I execute "cat /dev/syslog", it returns "/dev/syslog: invalid argument". This means no syslog window. Does anyone know of something that changed recently? 569. Date: Sat, 14 Oct 89 13:03:51 PDT From: brent (Brent Welch) Subject: Fsutil_HandleInstall I finally saw the bug in Fsutil_HandleInstall that has been bothering me for some time. Handle installation is sort of divided into two parts so that memory allocation can be done outside the Handle monitor lock. An external routine does a Fsutil_HandleFetch to see if the handle is already there. If it isn't, it allocates memory and then drops in to HandleInstallInt routine to install the handle under the monitor lock. The bug occurred if the handle appeared in the hash table in between the initial Fetch and the subsequent InstallInt. The InstallInt was clever enough to recheck for the existence of the handle, but it wasn't clever enough to return it! The external routine always assumed that the memory it allocated was the used for the handle, but that could be wrong. The result was a garbage handle being returned from Fsutil_HandleInstall. I had been suspecting the LRU replacement stuff, but I kept overlooking the obvious bug. Anyway, Oregano crashed during recovery with a garbage handle and this prompted be to look at the code again. I've rebooted Oregano (while pounding on its file systems with process migration) and it works ok. I'm going to add a little "would-have-crashed" print statement and reboot it again to make sure I'm exercicing the error case. 570. Date: Sat, 14 Oct 89 13:05:39 PDT From: brent (Brent Welch) Subject: Watchdog Reset during migraiton and recovery I started a pmake on sage and then rebooted Oregano. After Sage recovered its handles and started compiling again it suddenly got a Watchdog Reset. I assume that some migration related call didn't quite work right. ps. Thyme was also doing a pmake, but it survived. 571. Date: Sat, 14 Oct 89 13:56:06 PDT From: mgbaker (Mary Gray Baker) Subject: weirdness linting? I've been trying to lint the net module. If I execute lintsun4c in one window, it will try linting it. If I execute it in another window, it says it doesn't know how to lintsun4c. It used to know how a few minutes ago. The environments, etc, appear to be identical in the 2 windows. Could somebody tell me what's happening here? I'm executing all of this on a sun3. 572. Date: Sun, 15 Oct 89 15:41:47 PDT From: mgbaker (Mary Gray Baker) Subject: compiler problem for sun4c net module The compiler is generating signed byte loads instead of unsigned byte loads to access the fields of this structure: /* * Descriptor Ring Pointer (page 21) (Byte swapped. ) * Also, */ typedef struct NetLERingPointer { unsigned short ringAddrLow :16; /* Low order ring address. * Must be quad word aligned. */ unsigned int logRingLength :3; /* log2 of ring length. */ unsigned int :5; /* Reserved */ unsigned int ringAddrHigh :8; /* High order ring address. */ } NetLERingPointer; For instance, in the broken version it generates: 0xf605148c <NetLEReset+240>: ldsb [o0+0x16],o1 0xf6051490 <NetLEReset+244>: and o1,0x1f,o1 0xf6051494 <NetLEReset+248>: or o1,0x80,o1 0xf6051498 <NetLEReset+252>: stb o1,[o0+0x16] 0xf605149c <NetLEReset+256>: add l0,0x4,o1 while in the working version it generates: 0xf6051478 <NetLEReset+220>: ldub [o0+0x12],o1 0xf605147c <NetLEReset+224>: and o1,0x1f,o1 0xf6051480 <NetLEReset+228>: or o1,0x80,o1 0xf6051484 <NetLEReset+232>: stb o1,[o0+0x12] 0xf6051488 <NetLEReset+236>: add l0,0x4,o1 for the source code line 259 in netLE.c: 259 initPtr->recvRing.logRingLength = NET_LE_NUM_RECV_BUFFERS_LOG2; The kernels to compare are sun4c.broken and sun4c.works in my kernel directory. They are identical except that in the working version, the net net module was compiled with the old compiler and assembler. Both were compiled with optimization on in the net module. Didn't we go through this once before when we first switched to gcc and the new assembler? Sometime in mid-July? I have it in my log book as being July 14th. 573. Date: Sun, 15 Oct 89 16:20:10 PDT From: deboor@buddy.Berkeley.EDU (Adam R de Boor) Subject: Re: compiler problem for sun4c net module in the code you sent, it doesn't matter much if it does an unsigned or a signed load, since it immediately ands the result with 0x1f. What is of more concern, I should think, is the four-byte difference in the offset used to access the field, no? 574. Date: Sun, 15 Oct 89 16:58:45 PDT From: tve (Thorsten von Eicken) Subject: gluttony in weird state IP is up, RPC is down loadavgd lists it as being down. rpccmd -ping times out /sprite/cmds/ping answers (!) but with ~300ms delay what's that? I think I had it once before. Is the kernel dead but the user processes still alive? (huh?) 575. Date: Mon, 16 Oct 89 11:47:35 PDT From: root (The Sprite God) Subject: No add host script There obviously isn't a script that adds a Sprite to the network because there were a number of details left out regarding Garlic (a.k.a. Mustard). The symbolic link for its swap directory was wrong, and there wasn't an entry for it in /sprite/boot. 576. Date: Mon, 16 Oct 89 11:48:28 PDT From: root (The Sprite God) Subject: network routing We need to fix network routing for Sprite. When Mustard changed its identity to Garlic we had to rerun netroute on every host so that the ReverseArp done at boot time got the correct SpriteID back. 577. Date: Mon, 16 Oct 89 11:50:10 PDT From: root (The Sprite God) Subject: yp ethers needed for Sprite sun3s It turns out that an entry in the yp ethers databas is needed in order for a Sun3 to find out its Internet Address during bootstrap. Apparently Sprite doesn't properly do RARP. Furthermore, manually adding an arp entry on ginger didn't help. Only until I updated /etc/ethers and did a ypmake was Garlic (a.k.a. Mustard) able to get an Internet address. 578. Date: Mon, 16 Oct 89 11:51:55 PDT From: shirriff (Ken Shirriff) Subject: anise->ginger rcp When I try to rcp a kernel from anise to ginger, the rcp seems to go into the twilight zone after copying, say, 188416 or 24576 bytes. After that nothing happens. Also, "size" on the sun4 returns exit status 2, causing my pmake to quit unless I do pmake -i. 579. Date: Mon, 16 Oct 89 11:56:29 PDT From: root (The Sprite God) Subject: ds3100 need yp ethers entry, too It turns that Sprite DecStations also need an entry in the YP ethers database so they too can ReverseArp and discover their Internet Address. We need to fix Sprite so it can do its own ReverseArp. 580. Date: Mon, 16 Oct 89 12:30:50 -0700 From: bks@okeeffe.Berkeley.EDU (Brian K. Shiratsuki) Subject: yp ethers needed for Sprite sun3s i see. i purposefully deleted the entries from the sunos tables because i didn't want the sun servers to compete with the sprite server(s). 581. Date: Tue, 17 Oct 89 10:10:17 PDT From: brent (Brent Welch) Subject: bib broken bib was ported to Sprite some time ago, but it doens't quite work right. In a short paper with four references it uses the last reference for all of them! The citations are [author88a] [author88b] and so on, and at the end the last citation is repeated four times. The example is in ~brent/doc/wwos.89 . There is a Makefile there. 582. Date: Tue, 17 Oct 89 11:01:32 PDT From: Fred Douglis <douglis> Subject: proc_serverproc needs to be dynamic background server processes should be handled like rpc_servers -- created when needed, up to a large limit, and reclaimed when not needed. otherwise we run into problems like brent's needing to have a separate recovery process, or kernels getting wedged when all the server processes go to sleep on some condition. 583. Date: Tue, 17 Oct 89 12:26:10 PDT From: mgbaker (Mary Gray Baker) Subject: kgdb.sun4 goes into the debugger On murder, I was debugging covet with kgdb.sun4. I did a "pid 0xc" commmand and it seg faulted. Here is the stack trace: #0 0x44310 in Fs_Read () #1 0x4018a in read () #2 0x98bc in myread (desc=6, addr=(caddr_t) 0x9ca2c20 "", len=3394721) (core.c line 459) #3 0xc94e in psymtab_to_symtab (pst=(struct partial_symtab *) 0xc9334) (dbxread.c line 2739) #4 0x2f1d2 in find_pc_symtab (pc=4127256644) (symtab.c line 1122) #5 0x2c266 in select_frame (frame=(FRAME) 0x172cbc, level=0) (stack.c line 615) #6 0x1ad72 in normal_stop () (infrun.c line 1084) #7 0x1a028 in start_remote () (infrun.c line 414) #8 0x288fa in remote_attach (pid=12) (remote.c line 262) #9 0x1bd22 in pid_command (args=(caddr_t) 0x7dcbc "0xc", from_tty=1) (kgdbcmd.c line 89) #10 0x1cb58 in execute_command (p=(caddr_t) 0x7dcbc "0xc", from_tty=1) (main.c line 481) #11 0x1cc2a in command_loop () (main.c line 507) #12 0x1ca14 in main (argc=2, argv=(caddr_t *) 0x9fdfd04, envp=(caddr_t *) 0x9fdfd10) (main.c line 434) 584. Date: Tue, 17 Oct 89 12:29:09 PDT From: mgbaker (Mary Gray Baker) Subject: deadlock on covet Before the debugger crashed, I got the following info about the deadlock on covet: The deadlock was over the schedMutex lock. The process holding the lock was the "su" program. It was grabbing the lock in Sched_LockAndSwitch(). The current process was the ipServer. It was grabbing the lock in Sched_GatherProcessInfo(). 585. Date: Wed, 18 Oct 89 12:49:03 PDT From: brent (Brent Welch) Subject: Vm_Stat.kernMemPages wrong on ds3100 The kernMemPages value looks more like a byte count as opposed to a page count. On pepper, for example, it is currently 1050532. Actually, the kernel page count on pepper is 1050532 / 4, or 262633. It isn't clear yet what this number is. The kernMemPages field includes a very large hole in the VM address space. The kernel code is loaded at 0x80000000, while the data is loaded at 0xc0000000, and the kernMemPages is wrongly calculated by subtracting the start of the code from the end of the data. I can account for this with the data I've already taken, but I think John H. understands how to fix this. 586. Date: Wed, 18 Oct 89 19:28:21 PDT From: eklee (Edward K. Lee) Subject: tx window disappeared I was running a shellscript on sassafras from forgery when after about 20 minutes or so, I got the following message to my syslog and my window to sassafras died along with whatever I happend to be running. PdevWrite: signal 14 PdevWrite: signal 14 PdevWrite: signal 14 This is the second time that this has happend to me. 587. Date: Thu, 19 Oct 89 10:39:48 PDT From: brent (Brent Welch) Subject: Allspice crash Allspice died with a recursive TtyBufferOverflow. It was streaming this message to its console and not responding to any interrupts. I had accidentally used the more program and driven the terminal into a goofy state. I just left it that way because it hasn't always worked for me to power cycle the terminal. Perhaps I should have tried that. Sometime later the crash occurred, I think there were at least several hours in between when I wedged the terminal (I think around noon time) and when Allspice crashed at about 6:25. 588. Date: Thu, 19 Oct 89 11:30:28 PDT From: mgbaker (Mary Gray Baker) Subject: tx in debugger on sparcstations Tx frequently dies on the sparcstation. If I remember correctly, which I seem to do with decreasing frequency, Mendel reported a tx problem on the sun3 where it died with its pc set to the instruction after a select trap. That's what's happening here. #0 0x481ec in Fs_RawSelect () #1 0x3b5fc in Fs_Dispatch () #2 0x2214 in main () (tx.c line 135) #3 0x3b024 in start () 0x481e0 <Fs_RawSelect>: sethi %hi(0x0),%g1 0x481e4 <Fs_RawSelect+4>: or %g1, 72, %g1 ! 0x48 0x481e8 <Fs_RawSelect+8>: t 3, %g0 !0x3 0x481ec <Fs_RawSelect+12>: jmpl %o7, 8, %g0 ! 0x8 0x481f0 <Fs_RawSelect+16>: nop 589. Date: Thu, 19 Oct 89 11:32:04 PDT From: mgbaker (Mary Gray Baker) Subject: more on tx And if I detach the program in the debugger, tx picks up and keeps running fine. I believe Mendel mentioned that funny aspect as well. 590. Date: Thu, 19 Oct 89 17:45:25 PDT From: brent (Brent Welch) Subject: inetd on mint inetd went infinite on mint. I did a gcore into /sprite/src/daemons/inetd/inetd.core.8200e However, the stack backtrace is simply #0 0xc658 in Sig_SetHoldMask () so perhaps gcore isn't the right thing to figure out infinite loops. If someone knows gdb better they can try to figure things out. 591. Date: Fri, 20 Oct 89 08:49:00 PDT From: brent (Brent Welch) Subject: Allspice network interface reset When I came in Friday morning Allspice was in slow mode. An rpcecho reported timeouts 110 resends 110 acks 11 in 100 attempts! I reset its network interface by hitting break-n on its console, and now it seems fine. 592. Date: Fri, 20 Oct 89 11:53:29 PDT From: pmchen (Peter M. Chen) Subject: fatal error in /sprite/cmds/vi I've been getting these off and on. It goes away the second time I issue the vi, but it is kind of disconcerting. Any ideas about why these have been popping up? Anyone else experiencing these? The error is occuring on the decstation. 593. Date: Fri, 20 Oct 89 12:44:45 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: spur spritemon broken The new spritemon doesn't work on the spur. It dies in XtInitialize. I've replaced it with the old version. 594. Date: Fri, 20 Oct 89 15:00:58 PDT From: brent (Brent Welch) Subject: mint overload on friday Mint died on friday, after being up almost a week. It was struggling along when I went to investigate it, spending most of its time generating TtyInputBufferOverflow messages, along with messages about clients recovering, etc. I'm not sure what triggered the situation, but it eventually got so bogged down printing error messages that it couldn't make forward progress. I eventually got some keystrokes through, enough to sync the disks and hurl it into the debugger. The main thing I noticed from the debugger was that several processes were in the ready state, but presumably they weren't scheduled because of the heavy tty traffic. On an up note, when I rebooted mint I got my little print statement indicating that the bug concerning returning garbage handles was successfully tested. Mint would have died during recovery if this hadn't been fixed. On a down note, each client had to recover an average of 3 times before things settled down. 89 recovery attempts were made, and 20585 reopen RPCs were serviced. The last client finished recovery 5 minutes and 40 seconds after mint enabled its RPC service. 595. Date: Fri, 20 Oct 89 15:12:14 PDT From: Fred Douglis <douglis> Subject: Re: mint overload on friday that ttyinputbufferoverflow message is a pain in the neck. when i look at the tty stuff to see about processing at interrupt time, I can also put in a check so this message is printed only once.... 596. Date: Fri, 20 Oct 89 15:52:11 PDT From: Fred Douglis <douglis> Subject: update not setuid the ds3100 version of update, dated 10/3, was not setuid to root. Did someone install this by hand or something? 597. Date: Fri, 20 Oct 89 19:03:34 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: lprm bug If I queue a print job on a ds3100, it won't work (known bug), and then if I try to run lprm on a sun3 to delete the job I get: cfA025hijack.Berkeley.EDU: Permission denied 598. Date: Fri, 20 Oct 89 19:15:59 PDT From: pmchen (Peter M. Chen) Subject: Re: login: must be root to override defaults Yes, I restarted inetd on mustard by hand. This was necessary because on the decstations, you can't kill X and restart it without killing and restarting inetd and ipserver by hand. How should I make sure inetd has a clear environment? Start it up as root? 599. Date: Sun, 22 Oct 89 13:32:00 PDT From: brent (Brent Welch) Subject: Assault crash, out of memory? Assault hung today, after being up 8 days. I think it ran out of memory, but I can't be sure because the ds3100.1.032 kernel was carefully removed from all hosts! I think I already complained about this. Perhaps with our huge /sprite/src/kernel partition we won't be so hasty when removing kernel images. For the N'th time, never remove a kernel image if a file server is running it. This is easy to check, and unforgivable. (well, I'll forgive you this time.) Anyway, I've rebooted Assault with JHH.192. Don't even think about removing this kernel. This kernel lets the kernel and the fs cache grow much larger, so Assault shouldn't croak. 600. Date: Sat, 21 Oct 89 15:50:34 PDT From: brent (Brent Welch) Subject: Changing a domain's identity A weakness in the current prefix table stuff showed up when we moved /sprite/src/kernel to allspice. While we first unmounted /sprite/src/kernel from Oreagno and remounted that domain as /sprite/src/kernel.old, the internal domain number didn't change. This meant that clients which had prefix table entries for /sprite/src/kernel with the old token from Oregano were still accessing Oregano. What we need to do is change the internal domain number so the tokens (fileIDs) on the clients become invalid. John H. suggested that at boot time a server could check to see if its mounting a disk under the same prefix as before. This information is kept in the domain's summary sector on disk.h 601. Date: Mon, 16 Oct 89 14:33:43 PDT From: Fred Douglis <douglis> Subject: gethostname problem gethostname was changed sometime about a month ago to call Proc_GetHostIDs rather than Sys_GetMachineInfo. Unfortunately, it calls it to get the physical host rather than the virtual host, which means "hostname" and anything else that uses it will detect that migration has occurred. Is this a goof or was it intentional? I'm changing it to return the hostname for the home node. the world may need to be relinked. 602. Date: Mon, 16 Oct 89 17:20:24 PDT From: pmchen (Peter M. Chen) Subject: official bug report on gremlin This is the official bug report version of the gremlin problem I mailed to spriters: Ed and I have been trying to use gremlin on the ds3100's and have gotten a lot of weird things happening. 1) When you put down a point, black blobs often come on the screen. 2) The shift and control keys don't do what they're supposed to. Instead, they seem to repeat the last command issued. 3) the help screen is really garbled. Fred and John H. report that they've also run into these problems, which make gremlin extremely painful to use. 603. Date: Mon, 16 Oct 89 18:33:16 PDT From: eklee (Edward K. Lee) Subject: ds3100 crashes with FP exception in kernel I was running Sprite version 1.032 (ds3100). Running ~eklee/simtest/simtest from X causes the kernel to crash with a FP exception. I was able to repeat this three times consecutively. (Could trashing machine registers from user mode cause this to happen?) 604. Date: Mon, 16 Oct 89 20:40:24 PDT From: brent (Brent Welch) Subject: Oregano's network interface Sometime around 6:30pm Monday night Oregano's network interface went out-to-lunch. I came in and noticed a number of error messages and some recovery stuff. When I tried to do things like grep through system code there was essentially no progress until I hit L1-n on Oregano's keyboard to reset its interface. Someone (Mendel?) needs to figure out how to put in a watchdog on this flakey Intel interface. 605. Date: Sun, 22 Oct 89 17:32:54 PDT From: tve (Thorsten von Eicken) Subject: gdb problems on sun4 I can't manage to get to variables. I always get the message 'No symbol "foo" in current context'. Is this known? Am I missing something? I compiled with -g, and no optimization. Something funny though: when the symbol-file is read, I get an error message: Reading symbol data from /mic/X11R3/src/cmds/Xsp/sun4.md/Xsp...done. Type "help" for a list of commands. (gdb) Warning: Unknown symbol-type code `P' at symtab pos 296. The sameprogram, compiled for the sun3, loads into gdb without error. 606. Date: Sun, 22 Oct 89 19:19:02 PDT From: tve (Thorsten von Eicken) Subject: mkmf handles file named "version.h" specially this is NOT said in the manual, as far as I can remember! Thorsten (and I don't think it's a nice idea either) 607. Date: Sun, 22 Oct 89 19:37:45 PDT From: tve (Thorsten von Eicken) Subject: mkmf/pmake doesn't know how to make sun4.md/lex.o from lex.l on the sun3 and ds3100 everything is fine. on the sun4 i get a pmake: Can't figure out how to make sun4.md/lex.o. Stop error. I did many mkmf's, pmake tidy, etc.. no change. weird! 608. Date: Mon, 23 Oct 89 10:16:35 PDT From: shirriff (Ken Shirriff) Subject: tx refresh on ds3100 If I clear a tx window and then select "Set Termcap" from the "Control" window, on the decstation, the window scrolls before the menu disappears, leaving a white rectangle on the normally gray part of the window. This doesn't happen on the sun3. 609. Date: Mon, 23 Oct 89 18:37:56 PDT From: tve (Thorsten von Eicken) Subject: on sun4, pmake of bigcmdtop doesn't always do the final link It always goes down the subdirs and produces the linked.o, but it won't always do the final link of all the linked.o into the command. The behaviour is not consistently repeatable. It happens with /mic/X11R3/src/cmds/Xsp (the X11R3 server). 610. Date: Mon, 23 Oct 89 19:25:13 PDT From: tve (Thorsten von Eicken) Subject: is the cc man page up-to-date with gcc 1.36? It doesn't seems so... the comments for -gg are out of date, -fcombine_regs doesn't exists any more, etc... 611. Date: Tue, 24 Oct 89 10:12:14 PDT From: brent (Brent Welch) Subject: mint crash Mint died last night after /sprite filled up. After it ran out of paper it sort of hung, and then when I added paper I got the good old "TtyInputBufferOverflow" problem. Apparently all the Proc_ServerProc's were stuck on something. It is possible they were hung on recovery with Oregano. Oregano died for a different reason, a consistency check in the Reopen code that shouldn't have been there. Perhaps we should dedicate a process to tty input? I had to do this for recovery pinging because of similar problems. Historically we used to have several different kernel processes for different tasks, but Mike Nelson gradually changed most things over to use Proc_CallFunc. These are subject to starvation, mainly because they are used to handle page faults, and a crashed server can block page faults, thereby using up the Proc_ServerProcs. In this case, I don't think creating more Proc_ServerProcs is the right solution. Restructuring the page fault code so the retry is done at a higher-level, not using a Proc_ServerProc would be best. 612. Date: Tue, 24 Oct 89 11:03:27 PDT From: Fred Douglis <douglis> Subject: /tmp the remote link for /tmp disappeared sometime recently. i was unable to start up X properly a few minutes ago. anyone know the last time they're sure /tmp was still around? we might be able to focus on a recent reboot (like my own machine, or some other) as a culprit. 613. Date: Tue, 24 Oct 89 14:15:26 PDT From: brent (Brent Welch) Subject: bootp infinite A bootp went infinite on mint. I took a quick look at it was in Fs_RawRead, which is called from recvfrom(), which is called from main line 165. I suspect some bug in the interaction with the retry loop in Fs_RawRead. 614. Date: Tue, 24 Oct 89 17:51:37 PDT From: shirriff (Ken Shirriff) Subject: nm on ds3100 If I do nm ds3100.md/libc.o | grep errno I get V errno The man page says nothing about what "V" means. Anyone know? 615. Date: Tue, 24 Oct 89 18:19:59 PDT From: brent (Brent Welch) Subject: Re: Mx death (bad disk mapping?) Hmm. There shouldn't be any fragmenting going on out that far in the file. Nothing is fragmented beyond 40K, and 0xe000 is at 57K. 0x1e000 is 64K later. This isn't even block aligned. I don't think its RPC fragmenting because that isn't neatly aligned anyway, it crams as much as possible into each packet. It doesn't look like a cache hashing bug because that uses the standard hash function, multiply by a large prime, add 12345, etc. (light bulb goes on) It could be a disk alignment bug, what with our fancy mapping of blocks onto sectors. 64K is about a track size... Hmm, mint has a track size of 23K on its eagle, but blocks do overlap on adjacent tracks by 6K. It is quite possible there is some overlap that I don't expect because the drive is out smarting me, similar to what we experienced on /mic, althrough rarer because its due to sector slipping. What we should do the next time we have one of these botched files is determine what the disk block numbers involved are. brent (If that isn't clear, it seems possible that the last block in a cylinder is somehow mapped back onto another block in the same cylinder. I'm note sure exactly. I do know that things packed quite neatly into cylinders on the Eagles: ---------------------------------------------------- |..1.....|..2.....|..3.....|..4.....|..5.....|..6... track 1 ---------------------------------------------------- ..|..7.....|..8.....|..9.....|..10....|..11....|..12 track 2 ---------------------------------------------------- ....|..13....|..14....|..15....|..16....|..17....|.. track 3 ---------------------------------------------------- .18...|..19....|..20....|..21....|..22....|..23....| track 4 ---------------------------------------------------- 20 tracks in all, this pattern is repeated 5 times per cylinder. If the drive is stealing a block from me due to a bad sector, I don't know what might happen.) 616. Date: Wed, 25 Oct 89 10:00:12 PDT From: brent (Brent Welch) Subject: Warning: receiver framing error on mouse Either sage's mouse is slowly croaking, or the behavior of the tty-driver needs to be improved when there is a "receiver framing error on mouse". I can wedge my mouse by rapidly moving it around my screen. I get the error message and the mouse freezes. I then disconnect and reconnect my mouse and continue operation. Can't we reset the serial line (issue a break or something?) in this case? brent 617. Date: Wed, 25 Oct 89 10:49:28 PDT From: Fred Douglis <douglis> Subject: prefix mapping bug this may be the same as something we discussed before, but I'm not sure... % df /c Prefix Server KBytes Used Avail % Used /tmp oregano 300696 240823 29803 88% wasn't getwd supposed to fix this? does df do its own equivalent operation or something? 618. Date: Wed, 25 Oct 89 13:38:17 -0700 From: tve@ernie.Berkeley.EDU (Thorsten Von Eicken) Subject: nfsmounts on oregano very unreliable Right now, msgs doesn't work on gluttony and hangs forever (can't even kill). /eros/octtools is not available and hangs forever. Same yesterday evening. df hangs because oreganos nfs stuff is botched. I don't know where the problem is, but I get the impression I can't rely at all on the nfsmount stuff. Any comment? Shall I just forget about it and consider it as a probabilistic service? -Thorsten Sorry if I sound harsh, I should have waited to calm down before sending this mail... but from home (with a stupid tty) unkillable processes are a real pain (can't just delete the tx window). 619. Date: Wed, 25 Oct 89 16:41:12 PDT From: tve (Thorsten von Eicken) Subject: why isn't the dbm library installed? nor made for the sun4. I need it in X11R3. I'm gonna make the lib for sun3 and sun4 in /sprite/src/lib/dbm. Should it be installed? 620. Date: Wed, 25 Oct 89 16:48:26 PDT From: tve (Thorsten von Eicken) Subject: wrong error message when installing Have a look why this failed: [burble dbm] pmake install --- /sprite/lib/lint.sun4/llib-ldbm.ln --- Installing: /sprite/lib/lint.sun4/llib-ldbm.ln Couldn't create "/sprite/lib/lint.sun4/llib-ldbm.ln": file already exists. *** Error code 1 pmake: 1 error [burble dbm] l -d /sprite/lib/lint.sun4 1 drwxrwxr-x 2 mendel wheel 512 Oct 21 13:08 /sprite/lib/lint.sun4/ [burble dbm] l /sprite/lib/lint.sun4 total 107 1 drwxrwxr-x 2 mendel wheel 512 Oct 21 13:08 ./ 2 drwxrwxr-x 44 root sprite 1536 Oct 22 12:17 ../ 60 -rw-rw-r-- 1 rab wheel 55198 Oct 21 13:07 llib-lc.ln 1 -rw-rw-r-- 1 mendel wheel 517 Jul 21 14:31 llib-lcmd.ln 10 -rw-rw-r-- 1 douglis wheel 9853 Sep 27 13:50 llib-lcurses.ln 1 -rw-rw-r-- 1 mendel wheel 525 Aug 11 15:52 llib-ll.ln 4 -rw-rw-r-- 1 rab wheel 3462 Oct 9 12:09 llib-lm.ln 17 -rw-rw-r-- 1 shirriff wheel 17167 Oct 16 17:35 llib-lmx.ln 5 -rw-rw-r-- 1 mendel wheel 4124 Jul 21 14:15 llib-lsx.ln 6 -rw-rw-r-- 1 ouster wheel 5731 Oct 17 08:30 llib-ltcl.ln [burble dbm] Obviously it couldn't create the file because I have no write access to the DIRECTORY. It has nothing to do with the file itself... NB: I'll change the dir to be group sprite. 621. Date: Thu, 26 Oct 89 15:29:33 PDT From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: Xmfb.new crash My ds3100 window system just died. Maybe someone with access to the sources could make a quick check to see if anything obvious is wrong. It crashed while /usr was screwed up so maybe that has something to do with it. Segmentation fault [OpenFont:88 +0x8,0x420698] Source not available (dbx) where > 0 OpenFont(0xffff, 0x0, 0x0, 0x100a7f50, 0x7ddffba4) ["dixfonts.c":88, 0x420698] 1 ProcOpenFont(0x7ddffba4, 0x100a7758, 0x419f4c, 0x1, 0x2) ["dispatch.c":1067, 0x412a80] 2 dispatch.Dispatch(0x0, 0x0, 0x0, 0x0, 0x10009430) ["dispatch.c":316, 0x410f08] 3 main.main(0x0, 0x0, 0x0, 0x0, 0x0) ["main.c":242, 0x402da0] 622. Date: Fri, 27 Oct 89 13:54:26 PDT From: pmchen (Peter M. Chen) Subject: wrong server ID's Warning: Rpc_Dispatch, wrong server ID 25 Client 33 rpc 2 at address: 08:00:20:01:7b:fc Warning: Rpc_Dispatch, wrong server ID 9 Client 33 rpc 2 at address: 08:00:20:01:7b:fc These error messages were received on mustard (a decstation). 623. Date: Fri, 27 Oct 89 10:35:54 PDT From: Fred Douglis <douglis> Subject: pmake circular dependency bug If pmake is given a makefile where a target depends on itself, rather than printing something about a circular dependency, it just says "not remade because of errors". 624. Date: Fri, 27 Oct 89 15:48:43 PDT From: mgbaker (Mary Gray Baker) Subject: printer bug When the laserwriter runs out of paper in the middle of a job, it won't finish the job after you refill it. It prints out a couple more sheets and thinks it's done. 625. Date: Fri, 27 Oct 89 16:12:07 PDT From: mgbaker (Mary Gray Baker) Subject: another printer problem? My job just got printed again, although I didn't request it. Maybe this has something to do with its having run out of paper before? Maybe it decided to print out another 2 pages and then wait for a while and then print the whole thing again? 626. Date: Mon, 30 Oct 89 14:54:18 PST From: tve (Thorsten von Eicken) Subject: problem with ranlib or ar on sun4 libarary: /X11R3/src/lib/Xmu Let's see, Atom.c declares (globally) a couple of variables and CvtStdSel.c uses them (take, for example _XA_HOSTNAME). When I compile and link the library on a sun4, programs using this library will not link because of symbol undefined errors (the symbols defined in Atoms.c and used in CvtStdSel.c). When I link the same libarry on a sun3 for a sun4, everything is perfect. 627. Date: Mon, 30 Oct 89 12:53:28 PST From: tve (Thorsten von Eicken) Subject: /sprite/lib/man/config I have X11R3 man pages in /mic/X11R3/man and I would like to get them when I type man. If I use the "-c configFile" switch, I have top make a copy of /sprite/lib/man/config and maintain that. Or I would have to edit the config file and add /mic/X11R3/man at the bottom (which some people might not like). Is there another way? Can one specify more than one config file to man? 628. Date: Mon, 30 Oct 89 10:40:54 PST From: Fred Douglis <douglis> Subject: loadavg & recovery a lot of hosts are listed as being down since sometime in the middle of the night. i think some sort of reopen must have failed. however, i don't see anything in paprika's syslog, for example, to account for the loadavg daemon just going away. if anyone has anything in their syslog pertaining to this (aside from "waiting for recovery" messages) please let me know. 629. Date: Mon, 30 Oct 89 11:15:24 PST From: brent (Brent Welch) Subject: Fs_PageRead recovery failed <1> Ever had programs die after recovery because of: 10/30/89 11:41:25 mint (32) Fs_PageRead waiting Fs_PageRead recovery failed <1> Warning: VmFileServerRead: Error 1 from Fs_Read or Fs_PageRead MachTrap: Bus error in user proc c2139, PC = e0075d6, addr = 30400 BR Reg 80 It can happen if you are running a program that has been changed recently by removing the image and copying in a new one. While the server is up it doesn't delete the old version of the program because it knows it is being executed. However, after a reboot "the right thing" doesn't happen. Recovery seems to go ok, but later on when you fault on the code segment you get a paging error and your program dies. It seems like right thing could still happen because the old program images end up in lost+found (I can see the old version of mx there right now, for example, which was the program that died on me.) 630. Date: Mon, 30 Oct 89 14:00:14 PST From: ouster (John Ousterhout) Subject: Time change Messages in my /dev/syslog are coming out with the wrong hour (daylight savings time, still), whereas my xclock is OK and other programs seem to be OK. Is this a bug in the kernel? -John- 630. Date: Mon, 30 Oct 89 18:03:53 PST From: fubar (Jay Vosburgh) Subject: Bug: ls man page The file type specifier 'r' (for remote link, or whatever it's called) in the output of "ls -l" isn't documented in the man page...] 631. Date: Tue, 31 Oct 89 10:22:37 PST From: rbk (Bob Beck) Subject: Need device driver interface document For sprite drivers. This would be a big help in porting Sprite to other machines, where drivers exist but have (eg) BSD or SysV kernel interfaces. 632. Date: Tue, 31 Oct 89 10:33:46 PST From: rbk (Bob Beck) Subject: Need "md" module interface defitions document for Sprite This sould help in porting Sprite to new machines, by avoiding ambiguity in which procedures there are and what they actually must do. In the absence of such documentation, you have to look at existing modules and determine what the true needs are, filtering out machine depenedencies. On talking with John, it would seem a list of relevant procedures and maybe a 1-liner about the procedure is sufficient, if the procedure documentation (header) specifies the interface well enough. 633. Date: Tue, 31 Oct 89 11:18:16 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: user/kernel time wrong The user and kernel time statistics on a multiprocessor are wrong. The trap handlers have to be changed to mark the current processor as being in kernel mode. Right now this only happens on interrupts, where it looks to see what mode it was in before the interrupt. This works fine on a uniprocessor but not on a multiprocessor. 634. Date: Tue, 31 Oct 89 11:48:18 PST From: rbk (Bob Beck) Subject: MASTER_UNLOCK doesn't do test-and-set This can be a problem if the cache architecture of the machine doesn't support an "ownership" protocol -- eg, on Sequent Symmetry, if the cache is runing "write-thru", both the "acquire lock" and "release lock" must do test-and-set (just doing a write on the mutex variable can race with an attempt to acquire the lock). However, the current code "(semaphore)->value = 0;" does work on Symmetry running copy-back cache mode. Just thought you would be interested -- the MASTER_UNLOCK() implementation isn't truly machine independent, although it's defined in /sprite/src/kernel/sync/sync.h 635. Date: Tue, 31 Oct 89 14:48:53 PST From: brent (Brent Welch) Subject: RPC binding hosed Assault went into the same state that Mint and Allspice got into Monday morning. You could talk with Assault from some machines, but not others. I poked around and noticed (via rpcstat -sinfo) that many RPC requests were being dropped on the floor (the "noalloc" field). I didn't drop Assault into the debugger (kdbx fear), but when I rebooted it there was one hung kernel process, Rpc_Daemon. This is the guy that's in charge of creating new server processes and for closing up connections on idle channels. With this process hung only requests over currently existing channels were accepted, and no dynamic re-binding of server processes happens. I'll go look at the code and see if I can figure out why Rpc_Daemon hung itself. 636. Date: Tue, 31 Oct 89 14:50:09 PST From: rbk (Bob Beck) Subject: sync/syncLock.c has assumptions about memory system ordering of reads and writes Sync_SlowLock() (and others, I suspect) seem to rely on the memory system doing things in "right" order -- ie, Sync_GetLock() may call Sync_SlowLock() which will try to T&S the inUse variable, then set waiting=TRUE, then try the T&S again... However, Sync_Unlock() just writes inUse=0 *then* tests waiting... Although I think this works on Symmetry, it's not clean and relies on strict order of reads and writes to processor cache; ie, if Sync_Unlock()'s read of waiting passed its write of inUse, this code would race and fail. I would prefer to see this with explicit locking of the "Lock" variable that avoids these problems -- ie, state manipulation of the Lock variable while holding a mutex inside the variable (Sequents kernel mutex abstractions all behave this way). This (I think) is much more clear, and most/all MP systems will provide guarantees on cache/memory writes being done when a T&S completes (eg, to unlock the data-structure). I think some of the higher performing RISC parts due out in a year or so may violate the assumptions you're making here. On a further note, sufficiently highly optimizing compilers might take it upon themselves to re-order some of these statements. Volatile declarations may help, but may be too strong. Some people (eg, Sequent) are making the compilers sensitive to various procedures (eg, v_lock()) to know this is a mutual exclusion point, code cannot be moved across this boundary, and the HW insures previous writes are flushed when a T&S write completes. This dependency should be documented, if not resolved otherwise. 637. Date: Wed, 01 Nov 89 00:57:21 PST From: Fred Douglis <douglis> Subject: prefix bugs i wanted to make a simple change: make a ds3100 export /tmp. when i deleted /tmp from oregano's prefix table, though, it stopped dealing with other prefixes (i'd get "/c unreadable" even if i deleted it and rebroadcasted). i had to remove /tmp from /t1/hosts/oregano/mount and reboot oregano. then everything was okay, except that hosts with entries for /tmp were able to keep accessing /c/tmp even though oregano wasn't exporting. i could then explicitly delete /tmp and force a rebroadcast and that worked. 638. Date: Tue, 31 Oct 89 18:02:07 PST From: Fred Douglis <douglis> Subject: emacs & ipServer on ds3100 seems to be a ds3100 bug where killing the X server without killing an emacs client will leave the ipServer in an infinite loop. be advised in the meantime that exiting emacs explicitly is probably a Good Thing. 639. Date: Wed, 1 Nov 89 01:15:48 PST From: douglis (Fred Douglis) Subject: bug with permissions caching I am using /dist/dist/sprite/cmds.ds3100 as /sprite/cmds.ds3100 for my benchmarking. it contained no setuid files, so I found all the setuid files in the old cmds.ds3100 and made the new ones setuid. nevertheless, i couldn't run rlogin, even when i confirmed it was setuid root. however, copying the same file using update -O produced a file i could execute okay. looks like maybe sprite remembers the protection somehow?? 640. Date: Wed, 1 Nov 89 08:30:54 PST From: brent (Brent Welch) Subject: decstation fonts Once again my spritemon is messed up because of some quirk in the decstation fonts. I switched over to Xmfb.new so I know that caused it. However, I'm frustrated because the font stuff is black magic, and I hate that. I'd really like a 'fonts' man page so I can figure things out myself instead of having to whine to the bugs mailing list. Can someone start a font man page? brent p.s. I know I've complained about this before, and I've probably gotten a good answer. However, this breaks at such long intervals that I've forgotten the magic incantation. We need a man page. 641. Date: Wed, 1 Nov 89 15:19:16 PST From: brent (Brent Welch) Subject: Mint is ailing Well, we've been having some troubles with Mint, haven't we? I seems to get into states of overload and begins to misbehave. I want to fully understand things before I go hacking away, however. First, as a user, don't hesitate to send me mail if the system craps out on you and you resort to rebooting your client. Ideally you shouldn't have to do that, and I'd like to know about it if it happens. In the meantime I'm going to augment Mint's kernel with some hooks so I can get at its recovery-related state. It seems to get into modes where it thinks all the clients have rebooted, so it yanks the rug out from under them. This triggers recovery actions by clients, which then overloads mint. I know how to tune the client side so that recovery loads mint less, but first I want to understand why mint freaks out in the first place. 642. Date: Wed, 1 Nov 89 18:13:01 PST From: brent (Brent Welch) Subject: Re: Something bad about caching? Fenugreek is importing /sprite/lib/ from assault. I ran 'stat' on the files because I suspected something like this: <sage 892> stat /sprite/lib/include/sysStats.h --rw-r--r-- 1 ID=(1471,155) 8310 bytes /sprite/lib/include/sysStats.h Server Domain File # 32 1 90339 Version 62 UserType 0x0 Created: Nov 1 15:54:20 1989 Data modified: Nov 1 16:55:31 1989 Descr. modified: Nov 1 16:55:31 1989 Last accessed: Nov 1 17:06:44 1989 <fenugreek 2> stat /sprite/lib/include/sysStats.h --r--r--r-- 1 ID=(0,0) 8172 bytes /sprite/lib/include/sysStats.h Server Domain File # 25 2 5612 Version 3 UserType 0x0 Created: Oct 26 20:51:26 1989 Data modified: Oct 10 16:27:30 1989 Descr. modified: Oct 26 20:51:26 1989 Last accessed: Nov 1 17:00:40 1989 So this is a prefix bug, not a caching bug. It seem straight-forward to fix the prefix bug. I can mark exported prefix handles specially on the server, and verify this on naming operations. This would ensure than naming operations are denied when the server stops exporting a prefix. 643. Date: Thu, 2 Nov 89 13:05:31 PST From: brent (Brent Welch) Subject: Re: bad decStation kernel >From what I saw last night, the new ds3100 kernel was dying in Mach_TestAndSet with a "bad address on load". Any recent changes to the mach module? 644. Date: Thu, 2 Nov 89 13:29:03 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: watchdog reset Thyme just suffered a watchdog reset running kernel SPRITE VERSION BW.183. According to Brent this is very similar to the installed kernel. I wasn't able to get anything from the prom -- it looked like the pc and sp had been reset. 645. Date: Thu, 2 Nov 89 13:56:13 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: loadavg wrong for MP Loadavg doesn't deal with more than one processor and will compute cpu utilization wrong on a multi-processor. I don't have time to fix it right now so this message serves as a reminder to do it later. 646. Date: Fri, 3 Nov 89 08:21:05 PST From: rbk (Bob Beck) Subject: Misc header file glitches John had asked me to notice procedure headers that seem "weak" or otherwise questionable -- ie, not sufficient specification of the procedure... I didn't find many (yet ;-), but thought I'd pass these along... /sprite/src/kernel/vm/spur.md/vmSpur.c VmMach_BootInit() Semantics and interface not well specified /sprite/src/kernel/sync/syncLock.c Sync_GetLock() Semantics not specified other than "this is kernel version". /sprite/src/kernel/rpc/rpcCall.c comment at top talks about lust:~brent/src/sun/sys/h/rfs.h -- is this still valid? ~brent/src/sun/sys/h/rfs.h doesn't exist on the Sprite network. Sig_Send() has a comment: "When we go to a multi-processor this routine must be rewritten to possibly interrupt a running process". Is this comment still valid? It looks like Sync_WakeWaitingProcess() handles waking the other processor... 647. Date: Fri, 3 Nov 89 11:28:09 PST From: shirriff (Ken Shirriff) Subject: sed bug ls | sed -e "/e/x\ /e/p" causes a segmentation violation in sed. 648. Date: Fri, 03 Nov 89 15:40:56 PST From: Fred Douglis <douglis> Subject: tftpd dregs mint has about a half-dozen tftpd processes lying around. I don't know which ones to kill, or why they're not dying. I thought this bug had been fixed a while ago. 649. Date: Fri, 03 Nov 89 15:55:05 PST From: Fred Douglis <douglis> Subject: eviction/loadavg bug i noticed that sage was listed as being down; debugging it showed it was in the middle of an eviction request. looks like it's possible for an eviction to get lost, or in any case there may be some race condition. next time someone notices loadavg getting wedged, please let me know so i can debug the kernel to see where the process is and the internal kernel state relating to eviction. 650. Date: Fri, 3 Nov 89 16:25:45 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: gethostent The routine gethostent() appears in the gethostbyname man page, but does not exist in the C library. 651. Date: Fri, 03 Nov 89 17:47:05 PST From: Fred Douglis <douglis> Subject: ds3100 ar/ranlib status? i'm really getting fed up with ar appending new copies into libraries instead of replacing the old ones. I was fed up enough to try to build a new ar and fix the problem. The catch is, it was just recently recompiled, and the new one worked fine when given the same command (ar r ....). I looked in the sprite log and it seems the problem is really with ranlib: the sprite ranlib wouldn't compile (and still won't), and the ultrix ranlib wouldn't work with our ar. as a temporary fix, i am going to install sprite's ar as ar.sprite, and change biglib.mk to invoke ar.sprite instead of ar for decstations. as a more permanent fix, we need to fix ranlib. I took a look at it and don't think it looks good -- the a.out hdr formats and constants and macros are all too different. I started trying to convert it but found that I'm missing the "symbol table offset" that's there for the suns. maybe bob knows more about this stuff and can take a look sometime? 652. Date: Sat, 4 Nov 89 16:02:20 PST From: tve (Thorsten von Eicken) Subject: ranlib on sun4 very flaky I mentioned this before... try: cd /sprite/src/lib/dbm; pmake clean; pmake # I did "pmake installdebug" but # I suppose "pmake" will do it too. ... and watch the ranlib go into debug state... 653. Date: Sat, 4 Nov 89 16:44:35 PST From: tve (Thorsten von Eicken) Subject: sed problem on sun4s The following, found in /sprite/lib/mkmf/mkmf.top, doesn't work on sun4's because sed doesn't output the last line of input if it isn't terminated by a newline. I.e. after the "tr" command above, the input to sed is a single line without terminating newline. Sed on the sun4 will not output anything at all. 654. Date: Sat, 4 Nov 89 16:59:36 PST From: tve (Thorsten von Eicken) Subject: are process ids guaranteed to be unique in the network? Or is there just a "high probability" that they are unique? Doing pmakes on sun3's I often get (compiling for sun4): --- sun4.md/XCopyArea.o --- rm -f sun4.md/XCopyArea.o cc -DERRORDB=\"/X11R3/src/lib/X11/XErrorDB\" -DTCPCONN -DFONT_SNF -DFONT_BDF -DCOMPRESSED_FONTS -DSPRITE -Usprite -Uunix -Uultrix -DINCLUDE_ALLOCA_H -I/X11R3/lib/include -O -msun4 -Dsprite -Dsun4 -I. -Isun4.md -I/X11R3/lib/include -I/X11R3/lib/include/X11 -traditional -fwritable-strings -finline-functions -fstrength-reduce -c XCopyArea.c -o sun4.md/XCopyArea.o /sprite/cmds.sun3/cpp: /tmp/cc727631.cpp: invalid argument *** Error code 1 pmake: 1 error *** Error code 1 pmake: 1 error and when I restart pmake everything is fine. Dunno whaats going on! 655. Date: Sat, 04 Nov 89 20:44:56 PST From: Fred Douglis <douglis> Subject: dumps didn't complete I saw that the dumps hadn't run last night, and that Bob apparently wasn't around, so I tried running them. I ran "dailydump" on murder, and it seemed to do /user1 and /user2 just fine but then died on /sprite with -: I/O error 656. Date: Sun, 5 Nov 89 00:28:04 PST From: tve (Thorsten von Eicken) Subject: /sprite/lib/sun4.md/libc.a:socket.o:_Stat_PrintMsg This symbol is undefined. I cant'a link any of my X stuff!! please fix quick! To test: cd /X11R3/src/cmds/Xsp; pmake TM=sun4 Sample: --- sun4.md/Xcfb --- rm -f sun4.md/Xcfb cc -g -O -msun4 -Dsprite -Dsun4 -o sun4.md/Xcfb ddx/snf/sun4.md/linked.o ddx/mi/sun4.md/linked.o ddx/mfb/sun4.md/linked.o ddx/cfb/sun4.md/linked.o ddx/sprite/sun4.md/linked.o dix/sun4.md/linked.o os/sprite/sun4.md/linked.o -ldbm -lm socket.o: Undefined symbol _Stat_PrintMsg referenced from text segment 657. Date: Sun, 5 Nov 89 01:08:45 PST From: tve (Thorsten von Eicken) Subject: no gcore for sun4's Could someone please compile/make one? 658. Date: Sun, 5 Nov 89 01:08:19 PST From: tve (Thorsten von Eicken) Subject: ipServer on crackle (sun4) in debug state Sorry, no core dump -> gcore doesn't exist Sorry, no backtrace -> /sprite/src/daemons/ipServer/sun4.md is empty Sorry, no backtrace -> /sprite/daemons.sun4/ipServer has no symbol table ... good job! 659. Date: Mon, 06 Nov 89 11:26:58 PST From: Fred Douglis <douglis> Subject: sun4/sun4c (emacs) incompatibility it seems that the same dumped version of emacs can't run on both vanilla sun4s and sun4c's. Since the predominant type of sun4s is, or will be, sun4c, I'm going to make the default version of emacs be the sun4c flavor. I'll move the other one to /emacs/cmds.sun4/emacs.sun4 (emacs.sun4c and emacs will be the same). As an alternative, I could remake emacs for the sun4 with CANNOT_DUMP defined, so it might start up okay on both types but would take "forever" to get going. Let me know if you have a strong preference. I am cc'ing bugs on this because it suggests we may have to consider methods for distinguishing between sun4s and sun4c's at user level (in %MACHINE). 660. Date: Mon, 6 Nov 89 12:32:23 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: mx bug/feature If I run mx on multiple file (mx *.c) the first time I use the "next" command to get to the next file it is a no-op. The second usage gets me to the second file, after which all uses work properly. 661. Date: Mon, 13 Nov 89 10:03:32 PST From: Fred Douglis <douglis> Subject: setjmp I checked, and the ds3100 is the only one that doesn't have _setjmp.o. It has setjmp.o. The ultrix libc.a has both. Any idea whether we used to have _setjmp.o? The real question is, can we restore it from tape, or do we grab the ultrix .o file, or what? 662. Date: Mon, 6 Nov 89 14:27:02 PST From: tve (Thorsten von Eicken) Subject: lps40 access for new machines crackle has no acces to the lps40. I guess there is the same problem with burble, buzz, treason 663. Date: Mon, 6 Nov 89 22:16:52 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: Rpc_ChanFree: freeing free channel Lust just crashed trying to free an already free channel. The structure looks ok to me, but the state is 0. I don't see any way in which this could have happened, but it did. Has anyone changed anything in the rpc module that could have caused this? I have a copy of the stack backtrace if it is helpful. 664. Date: Mon, 6 Nov 89 23:41:51 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: oregano crash I tried to use the IOC_SCSI_COMMAND ioctl to get the size of a disk attached to oregano and oregano died. When the scsi command completed RequestDone in sun3.md/devSCSI3.c called scsiDoneProc and passed it a senseDataPtr of 0. scsiDoneProc then died with a bus error. I know this ioctl works on a sun4 using a Jaguar, but the code doesn't look too different so I can't figure it out. All of the data structures looked ok, so I think there is just a goof in the flow of control when using this ioctl. 665. Date: Tue, 7 Nov 89 09:01:35 PST From: mendel (Mendel Rosenblum) Subject: Re: ranlib on sun4 very flaky > I mentioned this before... try: > cd /sprite/src/lib/dbm; pmake clean; pmake # I did "pmake installdebug" but > # I suppose "pmake" will do it too. > ... and watch the ranlib go into debug state... > Thorsten Actually, "pmake" works. "pmake installdebug" didn't until I reinstalled ranlib. There have been many cases of stale object files in /sprite/cmds.sun4. I think we should reinstall all the sun4 commands. 666. Date: Tue, 7 Nov 89 09:42:05 PST From: tve (Thorsten von Eicken) Subject: tftpd on crackle ?! I just rebooted, and am getting "inetd[...]: /sprite/daemons/tftpd: exit status 0x100" messages every minute or so (on the console). 667. Date: Tue, 7 Nov 89 09:48:41 PST From: mendel (Mendel Rosenblum) Subject: Re: tftpd on crackle ?! Some host on the net send tftp request to the broadcast address when trying to boot. Since the tftpd daemon was never installed on the sun4s but was listed in the inet.conf file, inetd would try to exec /sprite/daemons/tftpd when each request came in. I've installed tftpd for the sun4 so you should not see the message anymore. 668. Date: Tue, 07 Nov 89 13:55:45 PST From: Fred Douglis <douglis> Subject: Re: Down machines I noticed that. looks like some sort of migration bug, in that the "eviction request" may not have returned properly. why either host thought it had a foreign process is beyond me. however, loadavg.new lists mint as up, and that has the timeout i mentioned in the meeting, so i'm pretty sure that's the case. (i also noticed mint listed as "hasmig" shortly after it rebooted, and was planning to look into that at some point. difficult, though, when it's the file server.) 669. Date: Tue, 7 Nov 89 14:17:13 PST From: tve (Thorsten von Eicken) Subject: what version of gcc on sun4's??? I thought we had moved to gcc 1.36 a while ago? But look: cc -v -S goo.c gcc version 1.36 target machine is sun4 /sprite/cmds.sun4/cpp -v -msun4 -undef -D__GNUC__ -Dsparc -Dsun4 -Dunix -Dsprite -D__SOFT_FLOAT__ goo.c /tmp/cc669451.cpp GNU CPP version 1.34 /sprite/cmds.sun4/cc1.sparc /tmp/cc669451.cpp -quiet -dumpbase goo.c -version -o goo.s GNU C version 1.34 (sparc) compiled by GNU C version 1.34. 670. Date: Tue, 7 Nov 89 14:31:20 PST From: tve (Thorsten von Eicken) Subject: gcc floating point confusion on sun3's What is going on? When is the 68881 used and when not? When is __SOFT_FLOAT__ defined and when not? When do cpp, cc1 and as agree? [sassafras foo] cc -O -o goo68 goo.c -v -m68881 gcc version 1.36 target machine is sun3 /sprite/cmds.sun3/cpp -v -msun3 -undef -D__GNUC__ -Dmc68000 -Dsun3 -Dunix -Dsprite -D__OPTIMIZE__ goo.c /tmp/cc728371.cpp GNU CPP version 1.36 /sprite/cmds.sun3/cc1.68k -msoft-float -m68020 /tmp/cc728371.cpp -quiet -dumpbase goo.c -m68881 -O -version -o /tmp/cc728371.s GNU C version 1.36 (68k, MIT syntax) compiled by GNU C version 1.36. default target switches: -m68020 -mc68020 -m68881 -mbitfield -msun3 /sprite/cmds.sun3/as -m68020 /tmp/cc728371.s -o goo.o /sprite/cmds.sun3/ld -X -e start -o goo68 -L/sprite/lib/sun3.md goo.o -lc [sassafras foo] cc -O -o goo68 goo.c -v -msoft-float gcc version 1.36 target machine is sun3 /sprite/cmds.sun3/cpp -v -msun3 -undef -D__GNUC__ -Dmc68000 -Dsun3 -Dunix -Dsprite -D__SOFT_FLOAT__ -D__OPTIMIZE__ goo.c /tmp/cc531771.cpp GNU CPP version 1.36 /sprite/cmds.sun3/cc1.68k -msoft-float -m68020 /tmp/cc531771.cpp -quiet -dumpbase goo.c -msoft-float -O -version -o /tmp/cc531771.s GNU C version 1.36 (68k, MIT syntax) compiled by GNU C version 1.36. default target switches: -m68020 -mc68020 -m68881 -mbitfield -msun3 /sprite/cmds.sun3/as -m68020 /tmp/cc531771.s -o goo.o /sprite/cmds.sun3/ld -X -e start -o goo68 -L/sprite/lib/sun3.md goo.o -lc ------- I.e: it seems __SOFT-FLOAT__ is always defined, -msoft-float is default (yuck!) Thorsten (and Andreas who pointed me at this) 671. Date: Tue, 7 Nov 89 14:32:50 PST From: tve (Thorsten von Eicken) Subject: the csh notion of process time on the sun4 is broken. I guess it just need to be recompiled? Witness: [crackle foo] time ./goo i=100000 b=(INFINITY) 0.0u 0.0s 0:43 0% 0+0io 0pf+0sw 0k [lots of zeros here!] 672. Date: Tue, 7 Nov 89 16:50:12 PST From: shirriff (Ken Shirriff) Subject: Transient cc bug. While compiling fsCacheConsist.c for the ds3100, I got: ugen: internal L line 767 : build.p, line 1743 unexpected u-code. I tried it again and I didn't get this. 673. Date: Tue, 7 Nov 89 17:46:34 PST From: eklee (Edward K. Lee) Subject: pmake profile does not work for libraries 674. Date: Tue, 7 Nov 89 17:56:32 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: thyme crash, CallFunc: Process queue full Thyme crashed when the Proc_ServerProc queue filled up. All of the server procs were ready, and one was running. The queue was full of calls to TransferInProc. It looks like something (interrupt handler?) was stuffing calls to this procedure into the queue faster than the server procs could handle them. Here is the backtrace of the procedure that discovered the full queue: #0 panic (_va_args=235192782) (sysPrintf.c line 209) #1 0xe04c324 in CallFunc (funcInfoPtr=(FuncInfo *) 0xe80ff38) (procServer.c line 544) #2 0xe04bc70 in Proc_CallFunc (func=(void (*)()) 0xe014044, clientData=(ClientData) 0xe07e5c4, interval=0) (procServer.c line 174) #3 0xe01403a in DevTtyInputChar (ttyPtr=(struct DevTty *) 0xe07e5c4, value=56) (devTty.c line 536) #4 0xe00995a in DevConsoleInputProc (ttyPtr=(struct DevTty *) 0xe07e5c4, value=56) (sun3.md/devConsole.c line 328) #5 0xe014090 in TransferInProc (ttyPtr=(struct DevTty *) 0xe07e5c4, callInfoPtr=(Proc_CallInfo *) 0xe80ffd8) (devTty.c line 577) #6 0xe04c04c in Proc_ServerProc () (procServer.c line 376) #7 0xe056048 in Sched_StartKernProc (func=(void (*)()) 0xe04be58) (schedule.c line 944) (gdb) Thyme aborted out of the debugger, ignored the watchdog reset button, and suffered watchdog resets in the prom, so perhaps this is a hardware problem. 675. Date: Tue, 7 Nov 89 18:08:12 PST From: brent (Brent Welch) Subject: Blocking Fs_PageRead clogs the system The VM systems uses the Proc_ServerProcs to fill pages during a page fault. The problem is that Fs_PageRead blocks during recovery, and this can use up all the Proc_ServerProcs. Both sloth and thyme died because the Proc_CallFunc queue filled up. It couldn't be serviced because all the Proc_ServerProcs were blocked on recovery inside Fs_PageRead. This fix has to be inside the VM system. It has to figure out what to do if Fs_PageRead returns EWOULDBLOCK (or something) so that Fs_PageRead doesn't block. I know the VM system already does some recovery waits because it uses the handle of the swap directory for this. 676. Date: Tue, 7 Nov 89 18:30:00 PST From: mgbaker (Mary Gray Baker) Subject: sun4 debug crash There's a bug in the sun4's that causes a cache write-back error when you try to debug a user process. This is new and very bad. I'm investigating now. 677. Date: Tue, 7 Nov 89 18:55:53 PST From: brent (Brent Welch) Subject: pmake installhdrs in mach When I do pmake installhdrs in the mach module it claims there are no sources. It doesn't attempt to go into the .md directories. 678. Date: Tue, 07 Nov 89 22:18:27 PST From: Fred Douglis <douglis> Subject: sun4 library i saw that the new finger (with the new loadavg database file) was successfully installed the other day for all types but sun4, so I tried to compile it again. It wouldn't link because the installed sun4 libc.a was incomplete. When I tried to recompile, I had to rerun mkmf because lib/c/Makefile was set up only for "sun3" even though it looked like it hadn't been regenerated since last month sometime. Did someone edit it by hand? 679. Date: Wed, 8 Nov 89 23:43:12 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: more on process queue bug It seems fairly repeatable if you exit the console window on a sun3 such that your X window system gets torn down. You'll get a prompt back in the console window, but the first time you press a key the process queue overflows with calls to TransferInProc. I tried this twice on thyme running version 1.038. I looked at the stack but can't figure out whose putting all the calls in the queue. TransferInProc looks like it puts itself on the queue, but that shouldn't cause it to overflow. 680. Date: Thu, 9 Nov 89 12:08:44 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: 1.039 flakey There is an unknown bug in 1.039 that trashes your stack. Hijack died twice with a messed up stack. All I was doing at the time was editing files. Thyme suffered a watchdog reset when I started a pmake. Mint recovered for some unknown reason, and instantaneously thyme reset. The only stable machine this morning has been the spur, but I can't get to it because my other workstations insist on dying. 681. Date: Thu, 9 Nov 89 11:58:58 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: spritemon When I try to use spritemon to display the cpu utilization of 5 processors it screws up and only does 4, the fifth "pane" always being blank. Right now all 5 processors are pegged, but spritemon shows the 5th as having 0 utilization. 682. Date: Thu, 9 Nov 89 12:31:07 PST From: tve (Thorsten von Eicken) Subject: tftp/udp server failing (looping) on crackle (sun4) My syslog just showed the folowing. Is it relevant? <27>Nov 9 12:29:59 inetd[6370c]: tftp/udp server failing (looping), service terminated 683. Date: Thu, 9 Nov 89 14:27:40 PST From: mgbaker (Mary Gray Baker) Subject: Something funny with recovery? An ls to allspice hung on me. I wasn't getting reccovery even quite a while after murder did, so I killed the ls and re-executed it. Then I got recovery and the ls succeeded. 684. Date: Thu, 09 Nov 89 14:09:44 PST From: rab (Robert A. Bruce) Subject: blob from hell lives! I have a blob from hell in one of my tx windows. Clearing the screen does not kill it, nor does selecting something in another window. I am running the default tx, compiled on Oct 17. 685. Date: Thu, 09 Nov 89 14:28:56 PST From: Fred Douglis <douglis> Subject: Re: Something funny with recovery? i thought there was a process that pinged and tried to recover, but i had the same problem -- paprika said waiting for recovery but didn't recover until i tried something new that made it talk to mint. 686. Date: Thu, 9 Nov 89 14:31:36 PST From: brent (Brent Welch) Subject: 2 O'Clock Glitch Did your machine go through recovery at 2 this afternoon, or perhaps at 11 this morning? These glitches correspond exactly with the times all the hosts are sampling their kernel statistics by running a little script. The global crontab is set up to do this at 8am, 11, 2, 5, and 8pm. The file servers take a sample every hour. Anyway, this overloads Mint enough to cause glitches. My machine got timeout when writing back both its migInfo.new and migInfo files. It also had to try recovery twice. This is an interesting comment on scalability. In the meantime I'm going to add a pseudo-random sleep to the script that gets run by the crontab. 687. Date: Thu, 9 Nov 89 17:46:16 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: vfork behaves differently Our implementation is somehow different from the bsd implementation. I have a program that runs under unix, but not under sprite. When I changed the vfork to fork it works fine. I think the semantics of vfork state that the parent cannot run while the child is using its resources. This implies that the parent cannot run until the child exec's, and I have a hunch that isn't happening. 688. Date: Thu, 9 Nov 89 23:50:08 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: xkill If I try to 'xkill' an iconified window my uwm exits. hijack<jhh 3> XIO: I/O error [2] Exit 1 uwm There is no man page for xkill either. 689. Date: Thu, 9 Nov 89 23:51:23 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: more xkill info I realized my last message was kind of lacking on details. The windows I want to kill, and the uwm that dies are on hijack running 1.034. I usually run xkill from a sun3, or in this case a sun4. xkill on a ds3100 doesn't do anything. 690. Date: Thu, 09 Nov 89 23:55:30 PST From: Fred Douglis <douglis> Subject: Re: more xkill info i believe the xkill/uwm problem exists under all configurations. uwm must "own" the icon when xkill goes to kill the client, or something. or uwm just doesn't handle the condition it hits. i don't think it's an xkill bug. xkill on a ds3100 usually works for me, though sometimes it's actually killed my X server in the process. 691. Date: Fri, 10 Nov 89 10:59:56 PST From: culler (David Culler) Subject: To print or not to print I can send file to lw533 from remote unix hosts (e.g. fennel), but not remote sprite hosts (e.g. cardamom). lpq says: waiting for queue to be enabled on shallot 692. Date: Fri, 10 Nov 89 15:29:51 PST From: ouster (John Ousterhout) Subject: Re: To print or not to print The problem is that at present every individual workstation has to be entered in a particular printer table somewhere. Bob, is there a way to set up lpd in a fashion similar to sendmail, so that all print requests coming from any sprite machine are considered to come from "sprite.berkeley.edu", so that only a single entry has to be made in the printer table to accomodate all Sprite hosts? 693. Date: Fri, 10 Nov 89 15:39:04 PST From: Fred Douglis <douglis> Subject: Re: To print or not to print Are you sure that's the problem? Seems to me there's a difference between unauthorized access (printing to the lps40, for example) and a spooling problem that claims a queue is disabled. I think it's the sprite printing software that's confused. I've seen this happen on paprika even with machines that could normally print. 694. Date: Sun, 12 Nov 89 15:07:43 PST From: pmchen (Peter M. Chen) Subject: mustard crash--FPU interrupt in kernel mode Fatal error: FPU Interrupt in Kernel mode Entering debugger with a Breakpoint trap exception at PC 0x800b5550 I was running gremlin and ggraph and some other stuff. Mustard is a decstation. I am rebooting. 695. Date: Sun, 12 Nov 89 15:18:51 PST From: pmchen (Peter M. Chen) Subject: crash is repeatable Same crash (FPU interrupt in Kernel Mode), same error message (same PC). I was running the "new" kernel, which is 1.039, I think. To duplicate the problem: cd ~pmchen/simul/out/su_size_sy2 simgg norm100k (You have to have my alias for simgg). Simgg runs several things, among them a nawk script and ggraph. I'm going to go back to the "sprite" kernel. 696. Date: Sun, 12 Nov 89 15:38:33 PST From: pmchen (Peter M. Chen) Subject: ggraph is the culprit Regarding the recent crashes: Apparently ggraph, when given bad input (example file is in ~pmchen/simul/out/su_size_sy2/debug.gg) can crash a decstation (haven't tried it on a sun3 yet). To duplicate the crash: cd ~pmchen/simul/out/su_size_sy2 ggraph debug.gg 697. Date: Sun, 12 Nov 89 16:45:29 PST From: shirriff (Ken Shirriff) Subject: Makefile problem in /sprite/src/kernel/sprite If I do "pmake" in /sprite/src/kernel/sprite on the ds3100, it links a ds3100 kernel and then installs a sun3 kernel. If I do "pmake ds3100" it does the right thing. 698. Date: Sun, 12 Nov 89 17:45:46 PST From: brent (Brent Welch) Subject: Floating point on the DecStations John Hartman has mentioned that he thinks there is a race with the floating point unit when a trap is taken on the DecStation, which can result in a FPU trap in kernel mode. This is apparently the problem than Peter is having. I'm sending this because I'm not sure that John has posted a mail message about this. I do know that I've stopped running my floating point programs on the DecStations because they generate (NaN) every so often (divide by zero), and every so often they do this at the wrong time (time slice?) and cause their machine to panic. 699. Date: Sun, 12 Nov 89 17:52:03 PST From: tve (Thorsten von Eicken) Subject: problems I encountered when starting (a long time ago) on sprite I kept track of the major things I had problems with in the first weeks on sprite. Now that I've almost forgotten about the file, let me post it... Thorsten Instructions on how to boot machines. "-f tftp()foo" "le/ie(foo,goo,bar)gulp" howto make machines boot automatically F1-key combinations F1-k to kill window system, F1-A TX want customizable mouse actions (which mouse actions start, extend selections) tx shows #lines-1 x #columns-1 when resizing window tx insists on using ^U and ^H as kill/delete. look at parent tty! Various clarify cross compilation possibilities (sun3/sun4/ds3100) & problems howto mount foreign nfs file systems manual pages out of date mkmf creates *.md directories only for the machine type it is running on 700. Date: Sun, 12 Nov 89 20:22:52 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: Re: Floating point on the DecStations There is definately a floating point problem on the Ds3100. I suspect the problem is flushing the fp pipeline when entering the kernel. If one of the instructions generates an exception (NaN for example) then you get an exception while in the kernel. The code has to be smarter and understand when exceptions are allowed and when they aren't. I'll talk to Mike Nelson and see if we can come up with a simple solution. Unfortunately I don't have time to look into it right now. 701. Date: Fri, 10 Nov 89 12:34:31 PST From: Adam R de Boor <deboor@buddy.Berkeley.EDU> Subject: Re: more xkill info xkill issues an XKillClient using the window resource ID it gets back from the button press. The server makes no distinction between windows that are owned by the window manager and those that are owned by other clients. Since the XKillClient causes the server to forcibly shut down the connection between it and the client, there's nothing uwm can do if you click on an icon when running xkill (Fred's right: the icons are owned by the window manager). You will have to de-iconify the window and then run xkill. Sorry I never wrote a man page for the beast. It struck me as self-explanatory, but something describing foibles such as this would probably be a good thing.... 702. Date: Sun, 12 Nov 89 15:15:50 PST From: pmchen (Peter M. Chen) Subject: RpcDoCall Burble intercepted my broadcast for / in a reboot. I have no idea what this means, and everything proceeded hunky-dory after a 1 minute wait. This was on mustard 703. Date: Mon, 13 Nov 89 15:24:24 PST From: tve (Thorsten von Eicken) Subject: queue to lps40 hung mint.Berkeley.EDU: waiting for queue to be enabled on ginger Rank Owner Job Files Total Size 1st tve 115 (standard input) 50273 bytes 704. Date: Sat, 18 Nov 89 14:32:30 PST From: brent (Brent Welch) Subject: Change symbolic links to remote links My measurements indicate that symbolic links between domains cause the bulk of the pathname redirections. These can be eliminated by converting these cross-domain links to remote links and setting up the server to export them. This can be done on a live system by first exporting the prefix % prefix -x /tmp -l /c/tmp and then changing the symbolic link to a remote link. Remember to update the server's mount table with an entry like: Export /tmp /c/tmp Pathname redirections occur in about 15% of the lookups, although Mint has most of them, about 22% of its lookups bounce through a symbolic link. Overall 0.04% of the lookups bounce through a remote link, although Mint sees a lot of these, too, 0.48% up from 0.04% before ``/sprite/src'' was added. Here is the set of symbolic links in '/' lrwxrwxrwx 1 root 5 Jun 29 09:16 X -> /b/X lrwxrwxrwx 1 nelson 7 Jul 13 16:37 X11 -> /a/X11 lrwxrwxrwx 1 root 11 Oct 30 13:32 X11R3 -> /mic/X11R3 lrwxrwxrwx 1 root 12 Oct 26 1987 att -> /sprite/att lrwxrwxrwx 1 root 13 Jan 22 1988 bin -> /sprite/cmds lrwxrwxrwx 1 root 9 Aug 11 12:59 emacs -> /c/emacs lrwxr-xr-x 1 root 12 Aug 10 1987 lib -> /sprite/lib lrwxrwxrwx 1 root 13 Aug 7 1988 prob -> /test/rmprob lrwxrwxrwx 1 root 8 Jul 11 11:21 raid -> /b/raid lrwxrwxrwx 1 root 16 Jul 26 12:57 spare -> /rosemary/spare lrwxrwxr-x 1 root 13 Jun 15 1988 swap -> /sprite/swap lrwxrwxrwx 1 root 9 Nov 15 10:04 t88 -> /tmurder lrwxrwxrwx 1 root 13 Oct 18 15:26 tftpboot -> /sprite/boot lrwxrwxrwx 1 root 10 Aug 10 16:30 ultrix -> /c/ultrix Note also that /swap is a link to /sprite/swap, and everything there is also a link. I know mint shouldn't swap to the root domain, but it seems like /swap could be changed back to a directory, and the link for mint could point to '/sprite/swap/32' while the others would be links to '/swap1/hostnum'. The link from /tftpboot to /sprite/boot is probably ok to leave, but all the bin directories should probably be exported so that mint is out of the loop. brent 705. Date: Mon, 13 Nov 89 18:23:54 PST From: brent (Brent Welch) Subject: System crash Mint was accidentally rebooted with an old kernel on Sunday night, and it died monday afternoon with a known bug. Unforetuneatly, Oregano got confused sometime later and wedged things for a while. I debugged in and re-discovered an ugly problem I'd forgotten about. Somehow a Proc_ServerProc is leaving a handle locked and then going away. This quickly screws things up. It seems clearly related to recovery, so I'll spend some time looking at the code. brent 706. Date: Sat, 18 Nov 89 16:23:18 PST From: brent (Brent Welch) Subject: dump & bug status Murder was put into the debugger on Saturday afternoon, and it was part way through a dump at the time. Could someone in 477 check up on this? It was in a recovery loop with Mint, but I accidentally killed mint trying to continue execution and catch the problem. I've found yet another bug in the Rpc_Daemon process, another sublte synchronization thing that showed up under load. There was also a deadlock on TimerMutex that I didn't understand. Two processes were in Timer_ScheduleRoutine. One was being interrupted so it must have been executing near the LOCK/UNLOCK code. Someone might verify that interrupts are being dis-abled soon enough in the LOCK_MONITOR macro so that the deadlock warning is correct. brent 707. Date: Sat, 18 Nov 89 16:56:17 PST From: pmchen (Peter M. Chen) Subject: mail had no "to" field The following mail had no "to" field in the header. I suspect it was being sent to Garth. >From eklee Sat Nov 18 00:10:18 1989 >Received: by sprite.Berkeley.EDU (5.59/1.29) > id AA797494; Sat, 18 Nov 89 00:10:18 PST >Date: Sat, 18 Nov 89 00:10:18 PST >From: eklee (Edward K. Lee) >Message-Id: <8911180810.AA797494@sprite.Berkeley.EDU> >Subject: Second order disk model >Cc: pmchen >Status: R > >Pete and I have "perfected" a second order disk model. >This model takes as parameters, the number of cylinders, the step time, >the average seek time, and the full stroke seek time. >The model guarantees the values of the step, average and full stroke seek >times to equal that of the parameters. >We compared the amdahl drive characteristic to that predicted by the model >and they were very very close. >The model is in ~eklee/diskparam. > >Ed > 708. Date: Sat, 18 Nov 89 16:58:22 PST From: pmchen (Peter M. Chen) Subject: long running job dies on apathy I have a script which dies on apathy, but not on any other machine. On apathy it dies with MachExceptionHandler: User bus error on ld or st The program can be run by: cd ~pmchen/simul go apathy (You probably have to be me to get the paths, etc. right). 709. Date: Tue, 14 Nov 89 12:29:35 PST From: pmchen (Peter M. Chen) Subject: mustard.Berkeley.EDU: waiting for queue to be enabled on coriander This has caused me to not be able print for the last couple hours (during which I've tried rebooting mustard, coriander, power cycling the printer, etc.). Any tips as to how to continue? This happens consistently when I print several jobs in a row. mustard.Berkeley.EDU: waiting for queue to be enabled on coriander Rank Owner Job Files Total Size 1st pmchen 317 /users/pmchen/reminders 1997 bytes 710. Date: Tue, 14 Nov 89 10:26:48 PST From: pmchen (Peter M. Chen) Subject: printing many things When printing many things, one after another, weird stuff happens with the print daemon. First it stalls and doesn't send to coriander (the unix machine which serves our printer). Then an lpq returns with: mustard% lpq mustard.Berkeley.EDU: Warning: no daemon present Rank Owner Job Files Total Size 1st pmchen 286 (standard input) 16461 bytes 2nd pmchen 287 (standard input) 14940 bytes 3rd pmchen 288 (standard input) 13119 bytes 4th pmchen 289 (standard input) 12937 bytes no entries We can fix things by rebooting coriander, but that's hardly a good long term solution. It's odd because coriander can still print fine. 711. Date: Tue, 14 Nov 89 12:41:56 PST From: tve (Thorsten von Eicken) Subject: printer, printer, printer, where are you? [gluttony tve] lpq -Plps40 gluttony.Berkeley.EDU: waiting for queue to be enabled on ginger Rank Owner Job Files Total Size 1st johnw 133 shifter.ps 8573 bytes 2nd johnw 134 shift_block.ps 12586 bytes 3rd johnw 135 shifter.bdnet 2121 bytes 4th johnw 136 shift_block.bdnet 480 bytes 5th johnw 137 shift_block.bdnet 480 bytes ginger.Berkeley.EDU: connection to ucbarpa is down -------------------- at the same time -------------- ernie[tve] lpq -Plps40 Rank Owner Job Files Total Size active fisher 261 standard input 14058 bytes 0 bytes 2nd gill 238 taduty 1369 bytes 3rd fisher 263 standard input 47549 bytes 4th fisher 753 standard input 17739 bytes 5th fisher 754 standard input 15685 bytes 712. Date: Tue, 14 Nov 89 10:33:56 PST From: brent (Brent Welch) Subject: -Ppulla restarted I was able to get the printer in Peter Chen's office going again by restarting the lpd process on sage. Their printer is called "pulla", by the way. 713. Date: Tue, 14 Nov 89 16:02:28 PST From: eklee (Edward K. Lee) Subject: possible pmake bug I was trying to run pmake in ~eklee/sim on a ds3100. Pmake complains about: #if (%(TM) == "ds3100") "local.mk", line 3: Warning: Malformed conditional (( %(TM) == "ds3100" )) but accepts: #if (%(TM) == "sun3") 714. Date: Wed, 15 Nov 89 13:42:56 PST From: jhh@sprite.Berkeley.EDU (John H. Hartman) Subject: ntalkd bug Ntalkd was in an infinite loop on hijack. It was also changing pids every so often. I put it into the debugger, but was unable to find an unstripped version of the binary, nor was I able to build a new binary. 715. Date: Wed, 15 Nov 89 11:52:33 PST From: gibson (Garth Gibson) Subject: pmake errors On basil VERSION 1.034 (sun3) (17 Oct 89 14:18:43) I saw these messages: <11>Nov 15 11:37:58 syslog: Db_Open: error opening file /sprite/admin/migInfo.new: permission denied. <11>Nov 15 11:37:58 syslog: Db_Open: error opening file /sprite/admin/migInfo.new: permission denied. <11>Nov 15 11:38:37 syslog: Db_Open: error opening file /sprite/admin/migInfo.new: permission denied. <11>Nov 15 11:38:38 syslog: Db_Open: error opening file /sprite/admin/migInfo.new: permission denied. <11>Nov 15 11:38:51 syslog: Db_Open: error opening file /sprite/admin/migInfo.new: permission denied. <11>Nov 15 11:38:52 syslog: Db_Open: error opening file /sprite/admin/migInfo.new: permission denied. <11>Nov 15 11:39:27 syslog: Db_Open: error opening file /sprite/admin/migInfo.new: permission denied. <11>Nov 15 11:39:27 syslog: Db_Open: error opening file /sprite/admin/migInfo.new: permission denied. then my pmake hung with the message: basil 579> make --- sun3.md/cvscan.o --- rm -f sun3.md/cvscan.o cc -g -DNODATA -DTESTING=1 -DKERNEL=1 -L../raidlib -L../sim -g -O -msun3 -I/users/gibson/lib/include -I. -I. -Isun3.md -I../raidlib -I../sim -I../sim/sun3.md -I/sprite/src/kernel/dev -I/sprite/src/kernel/dev/sun3.md -I/sprite/src/kernel/Include -I/sprite/src/kernel/Include/sun3.md -c cvscan.c -o sun3.md/cvscan.o --- sun3.md/devDisk.o --- rm -f sun3.md/devDisk.o cc -g -DNODATA -DTESTING=1 -DKERNEL=1 -L../raidlib -L../sim -g -O -msun3 -I/users/gibson/lib/include -I. -I. -Isun3.md -I../raidlib -I../sim -I../sim/sun3.md -I/sprite/src/kernel/dev -I/sprite/src/kernel/dev/sun3.md -I/sprite/src/kernel/Include -I/sprite/src/kernel/Include/sun3.md -c devDisk.c -o sun3.md/devDisk.o make: Child (54f) not in table? ps tells me that make is in some form of infinite loop: USER PID %CPU %MEM SIZE RSS STATE TIME PR COMMAND gibson b054e 75.0 3.3 424 272 READY 3:43 make gibson 2050b 13.2 11.7 1648 960 READY 235:08 Xsprite :0 gibson 4051d 2.4 7.2 616 592 READY 8:29 tx =80x34+0-0 I did a Ctl-C on the make and it went into the debugger with the syslog msg: MachTrap: Bus error in user proc b054e, PC = f254, addr = 2e63207b BR Reg 2c020 the directory I am working in is ~gibson/RAID/sim.RAID/work i reran the make, got more Db_Open permission denied messages then make died with: basil 580> make make: Lockfile owned by you -- ignoring it --- sun3.md/mult.o --- rm -f sun3.md/mult.o cc -g -DNODATA -DTESTING=1 -DKERNEL=1 -L../raidlib -L../sim -g -O -msun3 -I/users/gibson/lib/include -I. -I. -Isun3.md -I../raidlib -I../sim -I../sim/sun3.md -I/sprite/src/kernel/dev -I/sprite/src/kernel/dev/sun3.md -I/sprite/src/kernel/Include -I/sprite/src/kernel/Include/sun3.md -c mult.c -o sun3.md/mult.o mult.c: In function mult: mult.c:43: warning: assignment of pointer from integer lacks a cast --- sun3.md/pseudoIO.o --- rm -f sun3.md/pseudoIO.o cc -g -DNODATA -DTESTING=1 -DKERNEL=1 -L../raidlib -L../sim -g -O -msun3 -I/users/gibson/lib/include -I. -I. -Isun3.md -I../raidlib -I../sim -I../sim/sun3.md -I/sprite/src/kernel/dev -I/sprite/src/kernel/dev/sun3.md -I/sprite/src/kernel/Include -I/sprite/src/kernel/Include/sun3.md -c pseudoIO.c -o sun3.md/pseudoIO.o Segmentation violation MachTrap: Bus error in user proc 53d, PC = 739e, addr = 2d672035 BR Reg 20 so I tried "pmake -x", got more Db_Open permission denied messages and another pmake: Child (d0539) not in table? 716. Date: Wed, 15 Nov 89 13:35:22 PST From: brent (Brent Welch) Subject: Proc_Lock race? Kvetching didn't quite make it out of recovery. I found that it was in Proc_WakeWaitingProcesses, stuck in Proc_Lock on an unused, unlocked process table entry. The condition variable was also zero, which means noone thought it was being waited on. The lock information said the process table entry had been last locked by some other process that had also gone away by this time. It looks like there is some race between ProcFreePCB, Proc_Lock, and Proc_LockID. Here is an abstract of each routine. Any ideas? Proc_Lock(pcbPtr) { LOCK_MONITOR; while (procPtr->genFlags & PROC_LOCKED) { (void) Sync_Wait(&procPtr->lockedCondition, FALSE); } procPtr->genFlags |= PROC_LOCKED; UNLOCK_MONITOR; } Proc_Unlock(procPtr) { LOCK_MONITOR; procPtr->genFlags &= ~PROC_LOCKED; Sync_Broadcast(&procPtr->lockedCondition); UNLOCK_MONITOR; } ProcFreePCB(procPtr) { LOCK_MONITOR; while (procPtr->genFlags & PROC_LOCKED) { (void) Sync_Wait(&procPtr->lockedCondition, FALSE); } procPtr->state = PROC_UNUSED; procPtr->genFlags = 0; UNLOCK_MONITOR; } Proc_LockPID(pid) Proc_PID pid; { LOCK_MONITOR; procPtr = proc_PCBTable[Proc_PIDToIndex(pid)]; while (TRUE) { if (procPtr->state == PROC_UNUSED || procPtr->state == PROC_DEAD) { procPtr = (Proc_ControlBlock *) NIL; break; } if (procPtr->genFlags & PROC_LOCKED) { do { (void) Sync_Wait(&procPtr->lockedCondition, FALSE); } while (procPtr->genFlags & PROC_LOCKED); } else { if (!Proc_ComparePIDs(procPtr->processID, pid)) { procPtr = (Proc_ControlBlock *) NIL; } else { procPtr->genFlags |= PROC_LOCKED; } break; } } UNLOCK_MONITOR; return(procPtr); } 717. Date: Wed, 15 Nov 89 18:13:35 PST From: gibson (Garth Gibson) Subject: brk bug I was reading comp.os.mach and I saw this brk bug testing program (below). I compiled it on rosemary and ernie where it passes, but on basil and apathy it fails. As it fails to "free" user heap store and users do not often free heap store to the system, you may not care. garth /* ** From: mcm@rti.UUCP (Mike Mitchell) ** Subject: Mach 2.5 bug ** Keywords: kernel expand(), PTE's ** Date: 16 Nov 89 00:35:56 GMT ** Organization: Research Triangle Institute, RTP, NC ** ** I have run into a problem with Mach 2.5. It is a problem that been in ** BSD 4.X until BSD 4.3-Tahoe. The fix is well understood for BSD systems, ** but I'm not sure how it fits into the Mach kernel. ** ** The problem is that memory pages are not returned properly when using the ** 'brk()' library routine to free them. More specifically, the PTE entries ** are not invalidated properly when shrinking a region. I can supply some ** diffs to fix the problem for BSD systems, but I've never seen Mach source. ** ** Anyway, try running the enclosed program. Please tell me if it works on ** your machine, and if so, what version of Mach and the type of CPU. ** * This program shows off a problem with the kernel's "expand()" routine. */ #include <signal.h> main() { char *old_break, *cp; int i; extern char *sbrk(), *brk(); void segv(); signal(SIGSEGV, segv); i = getpagesize(); old_break = sbrk(0); /* get the current "break" */ (void) brk(old_break + 2*i); /* bump it up 2 pages */ cp = old_break + i + 256; *cp = 1; /* write into a new page */ (void) brk(old_break); /* release the memory */ *cp = 2; /* write into the page again. This */ /* time, you should get a sigsegv */ printf("Your brk routine is broken!\n"); exit(1); } void segv() { printf("Your brk routine works correctly.\n"); exit(0); } /* ** Mike Mitchell {decvax,seismo,ihnp4,philabs}!mcnc!rti!mcm mcm@rti.rti.org ** ** "If you hear me talking on the wind, You've got ** to understand, We must remain perfect strangers" (919) 541-6098 */ 718. Date: Wed, 15 Nov 89 18:22:22 PST From: brent (Brent Welch) Subject: Makefile broken in /sprite/src/kernel/sprite The Makefile in /sprite/src/kernel/sprite only works if a TM environment variable is set. I don't ordinarily set this. I got the following error messages before I figured that I should set TM. "Makefile", line 28: Warning: Malformed conditional (!empty(TM)) "Makefile", line 30: #if-less #else "Makefile", line 32: #if-less #endif Fatal errors encountered -- cannot continue 719. Date: Wed, 15 Nov 89 19:00:18 PST From: brent (Brent Welch) Subject: Failed recovery I guess I have to take back my earlier complaints about page faults using up all the Proc_ServerProcs such that recovery is prevented. Sage failed to recover after Allspice rebooted, and I learned something by debugging it. The Proc_ServerProcs are not used at all! They were all available. There is some other reason that recovery doesn't kick in, and I haven't figured it out, yet. Also, I didn't find anybody stuck on the Proc_Lock, like what happened to Kvetching. Anyway, please let me know if your machine doesn't make it through recovery. I need to take another look at it. 720. Date: Wed, 15 Nov 89 19:05:24 PST From: tve (Thorsten von Eicken) Subject: /sprite/admin/howto/addNewHost I'm in the process of adding buzz (a sun3), here's what I'm encountering: #2. /etc/spritehosts is checked in (RCS) by mendel. I had to override. #3. /tftpboot is now on mint, not on ginger #3. the ndboot stuff seems to be bogus. The internet-address-file link is to sun3.md/netBoot (at least I think) #3. well, the whole stuff with the devices and so is bogus, isn't it? #4. it seems this step HAS gone away #5. the fsmakedev is unclear. What's the serialB business? It is not said that a dev directory has to be crated in /hosts/foo, and that the syslog should go there. #7. what's this "export command for the root partition"? #10. /etc/hosts.equiv is checked out by jhh #10. what's the business of 'hostname' vs. 'hostname.Berkeley.EDU' in /etc/hosts.equiv? Ok, except that I can't find the netBoot for sun3's with a lance ethernet, it seems I got through... Thorsten 721. Date: Thu, 16 Nov 89 02:08:07 PST From: shirriff (Ken Shirriff) Subject: Allspice problems Just as I decided to go home tonight, allspice started spewing out consist reply errors on pride. I checked allspice and it had about 40 tftpd's in the debugger. I tried to debug one of the tftpd's from nutmeg, but nutmeg hung. I tried to debug tftpd from allspice, but this seemed to upsed mint, which started trying to do recovery with allspice and failed. Since I couldn't access anything from allspice, I couldn't do any debugging so I rebooted it. I then found that the ipServer on mint seemed to be in an infinite loop but I couldn't debug it because accessing /sprite/src/daemons needed to wait on allspice. At this point, mint was printing out heaps of messages. Allspice came back up and I left Bob to look at the ipServer. 722. Date: Thu, 16 Nov 89 08:25:07 PST From: brent (Brent Welch) Subject: RPC Ethernet Protocol I suspect that once again Sprite RPC is colliding with some other Ethernet protocol. While we changed our protocol number away from the XNS_IDP number (0x600), we now use (0x500), a nice round number that is probably used for some other protocol. All the messages about RPC version mismatch are probably due to this. The fact that the Sun4 net module doesn't recompile is also a problem, but cause the network interface gets reset after too many errors, and eventually this can tickle the bug where a sender gets hung. Allspice is still susceptible to this bug. I'll bet that's what happened last night. There were lots of complaints about bad RPC packets at oregano, and lots of trouble between it and allspice. 723. Date: Thu, 16 Nov 89 10:56:59 PST From: johnw (John Wawrzynek) Subject: TLB fault I have been experiencing the following when I use emacs rmail to respond to a message: Bad user TLB fault in process xxx: pc=4752e8 addr=646e6553 xxx is an emacs process. Thanks. 724. Date: Thu, 16 Nov 89 11:28:43 PST From: Fred Douglis <douglis> Subject: Re: hung walls => pseudo-device startup bug seems like you could preserve the blocking semantics, if you think they're desirable, with two fixes: first, if the server exits, go through and find any processes blocked on the pdev; and second, make the open call use some sort of callback so that the open doesn't get hung and is instead delayed and retried. That way it would be interruptable. It seems like all pdev-related RPCs should really be done in a way that the failure of a user-level process won't hang another process forever. spring cleaning item, maybe? Fred 725. Date: Thu, 16 Nov 89 11:11:58 PST From: ouster (John Ousterhout) Subject: Trashed mail file My mail inbox (/sprite/spool/mail/ouster) got trashed again today, but the symptoms lead me to believe it's sendmail that's doing the trashing. There are two messages in the mailbox where exactly one line (or perhaps less than a line?) got messed up. Here is the raw text from the inbox: From douglis Thu Nov 16 10:25:35 1989 Received: from garnet.Berkeley.EDU by sprite.Berkeley.EDU (5.59/1.29) id AA663622; Thu, 16 Nov 89 10:25:33 PST Received: by garnet.berkeley.edu (5.57/1.32) id AA25948; Thu, 16 Nov 89 08:50:54 PST Date: Thu, 16 Nov 89 08:50:54 PST From: c60b2-am@garnet.berkeley.edu (Kevin Gong) Message-Id: <8911161650.AA25948@garnet.berkeley.edu> To: c60b2-am@garnet.berkeley.edu, ouster@sprite.Berkeley.EDU Subject: Re: "value" vs. "machineCode" Well, it's in the homework description, but it's also in the skeleton for one or more of the programs (classify.c, and/or findIns.c) in the comments. - kevin From douglis Thu Nov 16 10:31:35 1989 Received: from janus.Berkeley.EDU by sprite.Berkeley.EDU (5.59/1.29) id AA401482; Thu, 16 Nov 89 10:31:31 PST Received: by janus.Berkeley.EDU (5.57/1.34) id AA00686; Thu, 16 Nov 89 08:33:54 PST Date: Thu, 16 Nov 89 08:33:54 PST From: ilp@janus.Berkeley.EDU (Shelley Sprandel) Message-Id: <8911161633.AA00686@janus.Berkeley.EDU> To: ouster@sprite.Berkeley.EDU Subject: ILP meeting & software Cc: ilp@janus.Berkeley.EDU, neureuth@esvax.berkeley.edu I talked with Andy Neureuther about having Cindy at the meeting. He feels it might be better not to have her there. I'll have copies of the list of faculty responses she's received. Andy also wants to know the status of several things: the Commerce Dept. GTDAs and the draft of the license to companies who want to use the software commercially. Unfortunately, Cindy has very bad carpal tunnel syndrome problems, has two doctor's appointments today, and won't be in. I'll try calling her at home. She comes in at 7:00 usually, so we should have the information in time for the meeting. -Shelley Notice that the messages appear to be perfectly well-formed except that the first "From" line in each message lists Fred as the sender instead of the real sender. These messages were consecutive in the mailbox. The cleanliness of the substitution makes me think it isn't a random file-system error that's doing it, but rather something in the mailer. I've saved a copy of the whole mailbox in ~ouster/mail.bad in case anyone wants to look at the bits in more detail. By the way, Fred, I suspect that two message from you were lost. Can you resend them? -John- 726. Date: Thu, 16 Nov 89 11:15:51 PST From: Fred Douglis <douglis> Subject: Re: Trashed mail file mint was acting up before and i had to restart the ipServer and associated daemons. but before that, i found that a bunch of daemons weren't running, and i started sendmail by hand. i also ran "sendmail -q" to process the mail queue. for some reason, mail delivered by that sendmail run came out as "From douglis" for both you and mary. sendmail is setuid, so i don't know why that would be. 727. Date: Thu, 16 Nov 89 11:22:52 PST From: Fred Douglis <douglis> Subject: uwm bug the new uwm in X11R3 apparently doesn't pass environment variables properly. I can't start programs from within uwm unless I specify a display on the command line. /X/cmds.ds3100/uwm works fine. I can start programs from my shell just fine. 728. Date: Thu, 16 Nov 89 11:24:01 PST From: brent (Brent Welch) Subject: Re: hung walls => pseudo-device startup bug Apparently I need to fix the pseudo-device implementation so open attempts by clients are denied, not blocked, if the server process hasn't fully started up. Currently there is some situation where rlogind creates a ``/hosts/foo/rloginN'' pseudo-device, forks a child, and exits without finishing its startup duties as a pseudo-device server. The child process hangs, and subsequent wall processes also hang, because they too are clients of the pseudo-device. 729. Date: Thu, 16 Nov 89 01:35:07 -0800 From: tve@ernie.Berkeley.EDU (Thorsten Von Eicken) Subject: Re: hung walls => pseudo-device startup bug I just looked at allpice's syslog: The world seems to be in endless recovey loops. Crackle thinks allspice is recovering every 30 seconds or so. Allpice has messages about mint and oregano recovering all the time. 730. Date: Thu, 16 Nov 89 14:03:40 PST From: Fred Douglis <douglis> Subject: trashed file I found a file with a bunch of nulls in it. Since the file is updated every 5 minutes, I can put a bound on when the problem occurred: after yesterday at 12:40 pm, and probably before yesterday at 1:25 pm. Assault did not reboot at that time or since. I put the file in /user2/BADFILES/mig-usage; it's a couple of megabytes so if no one is interested in it then it should be deleted. 731. Date: Thu, 16 Nov 89 14:38:31 PST From: douglis (Fred Douglis) Subject: recovery killed X after kvetching recovered, the X server just kept printing out "WaitForSomething() errno=22" over and over. 732. Date: Thu, 16 Nov 89 22:00:47 PST From: douglis@rosemary.Berkeley.EDU (Fred Douglis) Subject: mint hit consistency deadlock again mint wedged with the good old problem where lots of rpc servers backed up on a consistency-in-progress flag for host 18, whichever that is (is spritehosts stored on unix anywhere??).... when i tried to continue mint to see what i might learn, it died because as usual i forgot to say "pid 0" before continuing it. 733. Date: Thu, 16 Nov 89 21:02:35 PST From: david@rosemary.Berkeley.EDU (David A. Wood) Subject: Unstoppable pmake?? I have been having some problems with pmake getting in an unkillable state tonight. The process 'WAIT's right at the beginning. Perhaps for a filesystem?? In any case, it does not respond to a ^C or ^Z, nor can I kill it with kill -KILL. The system is fine; I can rlogin again, but I can't get any work done. 734. Date: Thu, 16 Nov 89 22:54:50 PST From: shirriff (Ken Shirriff) Subject: frexp? ldexp(...) defined in gnulib/ds3100.md/ldexp.c calls frexp(...) which doesn't seem to be defined anywhere for the ds3100, which means my compiles bomb with Undefined: frexp. (ldexp and frexp are defined in include/sun3.md for the sun3.) 735. Date: Fri, 17 Nov 89 08:49:18 PST From: brent (Brent Welch) Subject: trashed stat files I made another pass through all my data files and turned up a number of trashed ones. They all had the first 2 or 3 Kbytes zeroed out, which points to a fragment bug. I've been saving these in files named 'nuked.08:05:01.Z' (with the appropriate date stamp) if their original was 'rawstat.08:05:01.Z'. I'm leaving them in their original directory like this so I can get an idea of when they get trashed vs. when they were created. My current thoughts are that they get trashed shortly after they are generated, probably in the delayed write logic. I suspect that their cache block is being re-used too soon, or something. All of these files are under 3K, and either the first 2K are zero, or the whole thing is null. There is some slight-of-hand done when a fragmented file has to be written out because it has to be realigned, and I'll be that's broken. 736. Date: Fri, 17 Nov 89 15:32:42 PST From: ouster (John Ousterhout) Subject: Fred is sending a lot of mail All the mail I've received in about the last hour has come out with Fred as the sender (same probably as as day or two ago, except it isn't going away). 737. Date: Fri, 17 Nov 89 15:50:19 PST From: Fred Douglis <douglis> Subject: Re: Fred is sending a lot of mail I think someone must have restarted mint's ipServer by hand (it wasn't a low process ID such as would be the case if it were the ipServer that started at boot-time). No sendmail background daemon was around. I started it as myself. Why sendmail persists in putting >From douglis at the start of each message is beyond me. I'll restart it as root and see how it goes. 738. Date: Sat, 18 Nov 89 20:52:22 PST From: tve (Thorsten von Eicken) Subject: /sprite/admin/responsibilities not many people have confessed up to now.... 739. Date: Sun, 19 Nov 89 01:11:13 PST From: tve (Thorsten von Eicken) Subject: need tsort program ... can't find it on Sprite ... 740. Date: Sun, 19 Nov 89 02:12:37 PST From: tve (Thorsten von Eicken) Subject: pmake problem I have problems with a traditional makefiles (for make, not pmake). These makefiles (over 200!) are in the octtools distribution which I'm trying to compile on sprite. The problem is that the makefiles tend to construct very long lines to circumvent the problem that every script line runs in it's own shell. For example: > cleaninstall: > @echo "# %(MAKE) %@" > @for x in `%{MAKEORDER} %{TOOLS}`; do \ > echo "cd %$x ; %(MAKE) install clean" ; cd %$x ; \ > %(MAKE) %(MFLAGS) %(MAKEVARS) install clean ;\ > echo "cd .. # done in %$x (%@)" ; cd .. ; \ > done What happens is that the very long line (started by "for x..." in this example) get truncated *silently* somewhere. Having tried several things, I suspect that it's pmake who clips the line. Could you please check and fix pmake? To test: "cd /mic/octtools/common/src; make cleaninstall". You should get a "/bin/sh: syntax error at line 1: `end of file' unexpected". Thorsten 741. Date: Mon, 20 Nov 89 11:01:26 PST From: pmchen (Peter M. Chen) Subject: lost mail I recently found out about some mail that did not get delivered to sprite. In the past, sprite has updated my /sprite/spool/mail/pmchen file incorrectly (mail is dropped in the middle of the file, etc.). I believe the mail was send from touati@arpa (Herve, let me know if I'm wrong on that). I think handling mail incorrectly will be strong impetus to use unix, at least to send and receive mail. 742. Date: Mon, 20 Nov 89 13:02:18 PST From: mendel (Mendel Rosenblum) Subject: Fs_SetAttributes bug Doing a RCS "ci" on a symbolic link creates a file which ls thinks is a symbolic link. The problem occurs because Fs_SetAttributes doesn't check to see that the permission field contains a valid permission value. When "ci" chmod's the new RCS file it somehow ends with the permission bits of 0xffffa16d rather than 0x16d. This causes Fs_SetAttributes to set the permission field of the file descriptor to 0xffffa16d. When ls does a stat() on this file the combatibility library does: unixAttsPtr->st_mode = spriteAttsPtr->permissions | CvtSpriteToUnixType(spriteAttsPtr->type); The extra bits in the spriteAttsPtr->permissions field now are or'ed into the type field causes the type field to become S_IFLNK. I've fixed the compatiblity library to handle bogus spriteAttsPtr->permissions fields. Someone should patch the hole in the Fs_SetAttributes syscall. 743. Date: Mon, 20 Nov 89 13:34:53 PST From: mendel (Mendel Rosenblum) Subject: Files not being dumped. If you create a directory structure such that the full pathnames of the files are more than 100 characters then the files will not be dumped by the Sprite dump program. The following files were not dumped by the last dump: /mic/octtools/common/lib/technology/scmos/msu/s150/mag/cs250_pads/SPUR_PADS/OCT_PADS/hgp/physical/contents; and /mic/octtools/common/lib/technology/scmos/msu/s150/mag/cs250_pads/SPUR_PADS/OCT_PADS/hgp/physical/interface; 744. Date: Mon, 20 Nov 89 13:42:32 PST From: Fred Douglis <douglis> Subject: another dump bug report the error messages mendel saw from the dump program were not mailed to me when I got the following message: >>>>> On Mon, 20 Nov 89 13:32:27 PST, root@sprite.Berkeley.EDU (The Sprite God) said: root> To: douglis root> Dump completed successfully. root> Level 1 dump on Mon Nov 20 11:46:26 1989 root> /user1 root> /user2 root> /sprite root> /sprite/src root> /sprite/src/kernel root> /mic root> /b root> /c root> / which means we could be hitting errors that Bob (or whoever is doing the dumps) never finds out about. Also, the dumps are not getting run automatically from murder's crontab -- I've had to do them by hand. And, the tape that's marked for yesterday's dump (11/19) ran okay at first, but when the dump program exited after the rlogin connection that had started it died, i got file mark errors on that tape and had to move on to the next tape. 745. Date: Mon, 20 Nov 89 16:49:43 PST From: Fred Douglis <douglis> Subject: ds3100 X server / keyboard problem after I tear down X, when typing at the console, I get a lot of bouncing -- many characters are echoed (and input) twice, especially if the shift key is held. 746. Date: Mon, 20 Nov 89 21:09:47 PST From: brent (Brent Welch) Subject: awk loop on sun4 Awk went into an infinite loop on anise. The same thing works ok on a sun3. To repeat: cd ~brent/postrawstat/Results.cache awk -f AwkCacheClt mustard.sun3.jul-nov.var- 747. Date: Tue, 21 Nov 89 03:14:58 PST From: eklee (Edward K. Lee) Subject: files slaughtered Today, many of my files disappeared from sprite. Most of the missing files seem to be binaries but I'm not sure of that yet. The director ~eklee/cmds.md disappeared altogether. This is the second time that this particular directory has disappeared. Bob, I would appreciate it greatly if you could restore ~eklee/cmds.md as soon as possible (I need it to start up X). thanks, 748. Date: Tue, 21 Nov 89 10:02:27 PST From: Fred Douglis <douglis> Subject: *.fsc files The location of these files keeps changing, so they're hard to find. For example, there are *.fsc files dated July in /mintA/boot, and it wasn't until Mendel suggested that I look in /hosts/mint that I found the current ones. The thing is, they print "checking /dev/..." without any indication of the date, so correlating error messages with boottimes is hard. By the way... Ed has lots of files in /sprite/lost+found, but nothing all that recent, and mint rebooted a few days ago -- not in the past day. I'm still interested in hearing when the last time is that Ed's sure the directory and files did exist okay. 749. Date: Wed, 22 Nov 89 10:17:50 PST From: mendel (Mendel Rosenblum) Subject: sun4 register trash bug: low priority The sun4 window underflow handler trashes some user accessible registers it probably should not. For example, the following routine returns 1 on SunOS and some value like 503315628 on Sprite. .globl _foo _foo: save %sp, -96, %sp call CallDeepEnoughtToFlushWindows nop mov 1,%o1 ret restore %o1,%g0,%o0 The problem occurs when the restore causes a window underflow. The underflow handler trashes the %o1 register which is used when the restore is reissued. This is not a high priority problem because the C compiler never generates code using restore in this way. The library routine longjmp() does use restore in this way and so longjmp(jmp_buf,1) causes the setjmp() to return with 503315628 rather than 1. If this becomes a problem we can probably make long jump a few instructions longer and get around the problem. 750. Date: Wed, 22 Nov 89 10:20:11 PST From: Fred Douglis <douglis> Subject: dumps going from bad to worse I changed crontab to pipe the output of dailydump into "Mail douglis" and got the following message at the time it was run. I think the problem may be with cron rather than dumps; murder's ipserver died as i tried to investigate further so I can't say for sure yet. But here's the note: ------- Forwarded Message Date: Wed, 22 Nov 89 02:00:07 -0800 From: root@sprite.Berkeley.EDU (The Sprite God) To: douglis@sprite.Berkeley.EDU lost+found spriteCory .Xdefaults ------- End of Forwarded Message At the same time that the ipServer died, the dumps started up (from crontab again, as I was debugging it) -- but died a moment later with a "catastrophic formatting error" from the exabyte. After popping the tape out and putting it back in (to make sure the tape had rewound), I couldn't get a green light from the exabyte. We finally power-cycled the exabyte and it came back. 751. Date: Wed, 22 Nov 89 14:21:34 PST From: Fred Douglis <douglis> Subject: dist file prot bug: migInfo Mike had trouble getting migration to kick in down at WRL because /sprite/admin/migInfo had the wrong permissions. Someone complained that our copy up here temporarily had the wrong permissions too. In case the distribution isn't already set up to create this file with mode 0666, I figured I'd report this. 752. Date: Wed, 22 Nov 89 16:46:17 PST From: shirriff (Ken Shirriff) Subject: Tx bug I grep'd through a file that wasn't ascii and my tx window went into an infinite loop. I couldn't figure out what was wrong before dbx decided to complain about Illegal Instructions, so this will probably have to be filed away until it reoccurs. The problem seems to be in Sx_Notify line 291, where it is trying to figure out the notifier size by calling EndOfLine to take chunks of the line. EndOfLine is stepping through the string, but somehow Sx_Notify keeps starting over and processing the same string. 753. Date: Fri, 24 Nov 89 15:19:10 PST From: tve (Thorsten von Eicken) Subject: mailbox corrupted This time it's my mailbox which got affected. Fred's message about RCS'ed systemfiles landed in the middle of one of the messages I left in my mailbox. 754. Date: Sat, 25 Nov 89 11:12:56 PST From: pmchen (Peter M. Chen) Subject: mint is somewhat hosed Getting lots of <reopen> 11/25/89 11:11:51 mint (32) RPC timed-out 11/25/89 11:11:51 mint (32) Recovery failedrpc timeout Am unable to get to command user commands on clients. 755. Date: Sat, 25 Nov 89 11:43:17 PST From: shirriff (Ken Shirriff) Subject: Mail got trashed I got two mail messages merged together into one: Message 118: >From netlibd@surfer.EPM.ORNL.GOV Fri Nov 24 23:50:33 1989 Date: Sat, 25 Nov 89 02:50:10 -0500 From: netlibd@surfer.EPM.ORNL.GOV (Netlib) To: shirriff@sprite.Berkeley.EDU Subject: send linpackc from bench Sorry, no such library is available. Recheck the general index. Here are some example requests, in case syntax is the problem: send index send index for eispack send rg from eispack who is eric grosse Received: by sprite.Berkeley.EDU (5.59/1.29) id AA335964; Sat, 25 Nov 89 11:12:56 PST Date: Sat, 25 Nov 89 11:12:56 PST From: pmchen (Peter M. Chen) Message-Id: <8911251912.AA335964@sprite.Berkeley.EDU> To: bugs Subject: mint is somewhat hosed Getting lots of <reopen> 11/25/89 11:11:51 mint (32) RPC timed-out 11/25/89 11:11:51 mint (32) Recovery failedrpc timeout Am unable to get to command user commands on clients. 756. Date: Sun, 26 Nov 89 15:10:05 PST From: ouster (John Ousterhout) Subject: Allspice reboot I rebooted Allspice this afternoon. It was refusing to talk to Mace, even after I L1-N'ed it to reset its network interface and pinged Mace from Allspice. Allspice did seem to talk to just about everyone else, and strangely enough the act of preparing it to reboot managed to clear up the condition with Mace (I went back to my office halfway through the Allspice boot cycle and discovered that Mace was no longer hanging). 757. Date: Sun, 26 Nov 89 16:15:22 PST From: Fred Douglis <douglis> Subject: ds3100 duplicate memory free panic kvetching died sometime this morning with a message about freeing a block that was already free, but i was unable to attach to it to debug the corpse. this is just for the record, to see if it's a fluke or the start of a trend. 758. Date: Mon, 27 Nov 89 10:33:11 PST From: mendel (Mendel Rosenblum) Subject: missing man pages from dump and restore >From /sprite/admin/howto/restoreAFile: > 6. For more information see the manual entries for `dump' and `restore'. murder% man restore No manual entry for "restore". murder% man dump No manual entry for "dump". 759. Date: Mon, 27 Nov 89 13:17:28 PST From: Fred Douglis <douglis> Subject: inetd/login problem explained george taylor was told to run "/hosts/hijack/restartservers", which is a setuid shell script that starts up various daemons. that explains why the real userid was gibson (taylor didn't exist until just now) and why i never had any trouble suing and then restarting daemons. making it a setuid shell script also means that when sendmail is restarted it will probably think it was run by a mere mortal and would post mail is if it were "From <user>" instead of the real person sending the mail. In other words, mere mortals shouldn't have to restart servers themselves, but if they have to, it must be done with the real userID set to root. 760. Date: Wed, 29 Nov 89 11:36:48 PST From: brent (Brent Welch) Subject: cross-loading Earlier I reported: ld: Bad machine type, not M_SPARC, for /usr/lib/libnet.a(Net_EtherAddrTo) when trying to make a sun4 kernel on sloth. This is because /usr/lib/libnet.a is a symbolic link to /sprite/lib/%MACHINE.md/libnet.a. This means that somehow only libc is special cased to work for cross-compiliation (cross-loading, actually.) Do we know this? Do we like it? 761. Date: Thu, 30 Nov 89 09:30:49 PST From: culler (David Culler) Subject: Dare I say, Ere SOSP I've encountered a couple of strange things on Sprite recently. (1) I sometimes lose typeout. It just stops echoing characters, although output from programs is displayed. This happens after I logout. It also seems to happen after running talk. In the second case, exiting the tx window and firing up a new one fixed it. In the other situation I had to reboot. (2) When an 'rsh' command is performed (I do this to print from Fennel) I get a message: "ioctl: Operation not supported on socket". The remote command does seem to take place, however. (3) The above situation arises because I can no longer get to lw533 from sprite. For awhile I could. Now lpq says, "Warning lw533 is down: sending to shallot". Unfortunately, nothing ever gets to shallot. (4) If I run dvi2ps on my machine and try to print the ps file, it looks rather impressionistic. Lots of interesting boxes, but few characters. Filtering the same dvi file through dvi2ps on fennel works fine. btw. Emacs still gets upset in trying to write files on Unix hosts. 762. Date: Thu, 30 Nov 89 10:36:35 PST From: ouster (John Ousterhout) Subject: fsattach man page This man page is a bit out-of-date. For example, it refers to "/local" in a few places. 763. Date: Thu, 30 Nov 89 13:51:55 PST From: mgbaker (Mary Gray Baker) Subject: cc1.68k cc1.68k goes into the debugger when run on a sparcstation trying to compile vmBoot.c for the sun3. 764. Date: Thu, 30 Nov 89 19:36:51 PST From: mgbaker (Mary Gray Baker) Subject: C library hash routine quite broken For Hash_CreateEntry, the test to see if an entry existed already was backwards. It should have been "if (!bcmp(...))" but was instead "if (bcmp(...))". What uses the C library hash routines? I know the kernel doesn't. 765. Date: Thu, 30 Nov 89 19:54:25 PST From: mendel (Mendel Rosenblum) Subject: Re: gdb on sun3 > gdb reports the stack as: > #0 0xe380 in Sig_Send () > #1 0x1b in ?? () > (gdb) > and I can't even see the stack frame of the main() routine. Try typing "si" and things will look better. Gdb is having trouble backtracing the stack after the Sig_Send syscall. The si causes it to execute the "addql #4,sp" instruction after the "trap #1" and put the stack in a format gdb can backtrace. 766. Date: Fri, 01 Dec 89 11:27:39 PST From: Fred Douglis <douglis> Subject: xhost % ls -l /X11R3/cmds.ds3100/xhost -rwxrwxr-x 1 stolcke 44 Nov 30 12:15 /X11R3/cmds.ds3100/xhost* % cat /X11R3/cmds.ds3100/xhost echo Access control buggy--no action taken. but if i try to run an x application on another host, i get an error, perhaps because that host (treason) isn't in /etc/X0.hosts. 767. Date: Sat, 2 Dec 89 13:30:28 PST From: shirriff (Ken Shirriff) Subject: Makefile for bib If I do a "make install" on bib, it puts the new bib in /users/shirriff/cmds.ds3100/bib instead of /sprite/cmds.ds3100/bib. Is there any reason for this? I moved the previous bib to /sprite/cmds.3100/bib.old and installed the new one myself, since the old installed bib hangs if it can't find a reference. 768. Date: Sat, 2 Dec 89 17:46:26 PST From: mendel (Mendel Rosenblum) Subject: sparcStation out-of-PMEGs bug Jaywalk hung on me when I ran a program that generated a large file. The reason it hung was it allocated almost all its PMEGS to the kernel and file system cache. This is the same problem we saw on allspice. Some time we need to patch the VM/filesystem not to wire down the PMEGs mapping the file cache. Until then we should limit the size of the file system cache on the sparc stations. 769. Date: Wed, 6 Dec 89 19:08:17 PST From: mendel (Mendel Rosenblum) Subject: sun3, sun4 allow *(char *)(-1) Both the sun3 and sun4 allow a user program to read a byte from the address 0xffffffff without an error. This is not true of the sun4c. 770. Date: Thu, 7 Dec 89 00:56:05 PST From: tve (Thorsten von Eicken) Subject: HELP mail seems flaky I know John Wawrzynek lost mail (he told me). I have a curiously empty mailbox, but I don't know whether I actually lost anything. Mint had tons of sendmail error messages on its console when we tried fixing it this afternoon from the out-of-processes state. Could someone please have a careful & thorough look into it? Is there any way to recover, or at least to get a list of the senders of lost messages? I do think this is important, some people are getting upset. 771. Date: Thu, 7 Dec 89 12:51:15 PST From: pmchen (Peter M. Chen) Subject: hard to send mail from arpa to sprite Herve Touati has consistently had problems in sending mail from ucbarpa to sprite. So far mail has been: 1) appended into the middle of my /sprite/spool/mail/pmchen file, 2) dropped totally, and 3) deferred: bad file number. He resent it: 772. Date: Thu, 7 Dec 89 16:17:29 PST From: shirriff (Ken Shirriff) Subject: gremlin bug If you start up gremlin "gremlin foo.grn" (where foo.grn is a gremlin file) and then hit undo inside gremlin, gremlin seg. faults on a sun3. 773. Date: Thu, 07 Dec 89 20:47:20 PST From: Fred Douglis <douglis> Subject: more sendmail problems mint's sendmail existed but was refusing connections since sometime around 5 or 6 today. mint's ipserver pid implies that perhaps it was restarted by hand. anyone know anything about this? anyway, i started a new sendmail. i can't debug the old sendmail since there's no unstripped binary -- i'll try to install a new one. 774. Date: Sat, 9 Dec 89 01:21:00 PST From: elm (ethan miller) Subject: problems with variables in Mail There are a bunch of variables, both set in .mailrc and environment, that seem not to show up in Mail. Among these are tabstr, prompt, and MBOX. The last, especially, creates a bit of a problem. This is occuring on a ds3100. No crashes, just a lack of some variables taking effect (they show up as set, but don't do anything). Does anyone know why this might be? 775. Date: Sun, 10 Dec 89 08:25:35 PST From: tve (Thorsten von Eicken) Subject: ntalkd doesn't link on sun4s --- sun4.md/ntalkd --- rm -f sun4.md/ntalkd cc -g -O -msun4 -Dsprite -Dsun4 -I. -Isun4.md -o sun4.md/ntalkd sun4.md/announce.o sun4.md/print.o sun4.md/process.o sun4.md/table.o sun4.md/talkd.o process.c:234: Undefined symbol _Ulog_GetAllLogins referenced from text segment 776. Date: Sun, 10 Dec 89 12:31:08 PST From: mendel (Mendel Rosenblum) Subject: Processes in NEW state on spacstations Sparcstations seem to be collecting processes with a state of NEW. For example from jaywalk: jaywalk% ps -a | grep NEW 71223 NEW 0:00 sh -c /c/stats/RAW a1222 NEW 0:00 sh -c /c/stats/RAW 11225 NEW 0:00 mkdir jaywalk/10Dec 41224 NEW 0:00 test ! -d jaywalk/10Dec 2122d NEW 0:00 test 5 != 0 777. Date: Sun, 10 Dec 89 19:02:06 PST From: mendel (Mendel Rosenblum) Subject: sparcstation watchdog reset When I try to attach an Xsprite that was in the debugger on jaywalk the machine got a watchdog reset. It looks like there is a bug in the code that handles window underflow traps with bad stack pointers. It appears to do the wrong thing when Proc_SuspendProcess returns. 778. Date: Sun, 10 Dec 89 21:32:31 PST From: mgbaker (Mary Gray Baker) Subject: Re: sparcstation watchdog reset The code in the window underflow stuff isn't supposed to handle anything further if the process has a bad stack pointer. That's why I was originally calling ProcExitInt on those processes, to make sure nothing returned into the window underflow handler at that point. I switched to Proc_SuspendProcess so that we might be able to debug processes with bad stack pointers due to a suggestion from an optimist that Proc_SuspendProcess doesn't return. It causes a context switch, but I guess I'm confused about what happens when attaching to a process on the debug list. If it causes it to return from Proc_SuspendProcess into the underflow handler, then all hell will break loose and I will indeed need to do something more complicated than just calling Proc_SuspendProcess. I have 2 ideas of what to do, and I'll work on it. 779. Date: Mon, 11 Dec 89 16:49:49 PST From: pmchen (Peter M. Chen) Subject: lpr queuing again When I issue a printer job after not printing anything for a while, it gets stuck in the "sending to coriander" stage. To fix it, I can lprm the job and resend it, which usually works (sometimes I need to do it multiple times). This happens consistently. 780. Date: Tue, 12 Dec 89 12:22:05 PST From: douglis (Fred Douglis) Subject: restore failed! it finally reached /b after some intolerable period of time, and immediately went into the debugger -- perhaps because it tried to restore lost+found, which existed? there was no error message, just a statement that it was in the debugger. 781. Date: Tue, 12 Dec 89 12:52:14 PST From: mendel (Mendel Rosenblum) Subject: restore calls abort() during reload The code in tar for the "-n" flag is not documented in man page, not listed in the "tar -help" list, and doesn't appear to work correctly. It seems to causes tar to delete all the files in a directory that are not on the dump tape. (Is this a good idea?) After doing the deletes it calls the routine usrrec() to skip over the tar records for the directory. This manages to mess up and call abort(). 782. Date: Wed, 13 Dec 89 08:12:07 PST From: brent (Brent Welch) Subject: Long I/O waits Peter sent me the following two interesting messages about I/O behavior on Sprite. Date: Tue, 12 Dec 89 13:58:27 PST From: pmchen (Peter M. Chen) To: brent Subject: diff hangs Status: RO I was doing a diff of some VERY LARGE files (80 MB) and diff hung (didn't respond to ctrl-Z or ctrl-C). I also can't seem to kill the process (it's in the READY state). This has happened before. The files were /scratch/pmchen/db2.trace.11.20 and /scratch/db2.trace.11.20. Any ideas? Pete Date: Tue, 12 Dec 89 14:52:40 PST From: pmchen (Peter M. Chen) To: brent Subject: diff hanging The problem is repeatable. However, the process doesn't hang indefinitely, just about 5 minutes after which it returns. Pete It appears as if the diff process got on the end of a long I/O queue. Perhaps some other activity at the file server clogged up the disk. 783. Date: Wed, 13 Dec 89 16:58:26 PST From: Fred Douglis <douglis> Subject: can't mount /scratch2 I don't know how to mount /scratch2. It didn't come up automatically even though /hosts/anise/mount exists. running fsattach complained it didn't know where "mount" was. running it with an option to specify /hosts/anise/mount caused it to check the disk fine but complain it didn't know anything about /bootTmp. Seems like something funny is going on regarding anise not being set up with /bootTmp the way other machines are. I didn't see anything in the fsattach man page to explain it. 784. Date: Wed, 13 Dec 89 22:37:03 PST From: tve (Thorsten von Eicken) Subject: problem withmig on sun4s [crackle pmake] mig Error execing program: unknown error (0) Also, pmake doesn't actually seem to migrate anything? 785. Date: Thu, 14 Dec 89 11:49:58 PST From: Fred Douglis <douglis> Subject: /usr/lib why does ld look in /usr/lib instead of /sprite/lib/%TM.md? this came up in an earlier bug report about /usr/lib being a link and still is a problem. 786. Date: Thu, 14 Dec 89 12:46:20 PST From: Fred Douglis <douglis> Subject: mkmf incompatibilities control needed we need a way for Makefiles to check some sort of version number in the system makefiles they include. thorsten's problem, i believe, was due to Makefile not defining TM while script.mk expected it to be defined. perhaps this should be a spring-cleaning item? 787. Date: Thu, 14 Dec 89 12:48:17 PST From: brent (Brent Welch) Subject: double insert cache bug found I've finally found something wrong with the cache. Ironically it was my mousetrap routine that uncovered it, but not the way it was supposed to. The original mousetrap is in the blockWrite routine. If looks through the block index map to make sure the block seems like it belonged where it was being written. This causes extra uses of the indirect blocks, and this in turn exposed a "double insert" bug in Fscache_FetchBlock. If a cache block isn't found (say an indirect block), then Fscache_FetchBlock takes a block off the LRU list. This too might fail if the cache is full of dirty or in-use blocks. In this case Fscache_FetchBlock waits for room in the cache. The bug was that Fscache_FetchBlock didn't look in the hash table after it waited. (It only re-hashed if it first found the block but it was locked). It was possible for another process to load the indirect block into the cache, and then to have the original process wake up, take a block of the LRU list, and insert the block into the hash table again. Voila double insertion, and the previous incarnation of the block was lost. In the case of the indirect block the machine crashes when the second instance of the block gets deleted because there is no longer an entry in the hash table; it was removed when the first instance of the block was removed. This bug might also explain the fragment bug because the block cache is used when growing a fragment. However, I am not positive of this. UpgradeFragment fetches the block containing the previous incarnation of the fragment. It then changes the disk address of the fragment and unlocks the cache block. If a double insert happened then the first incarnation of the fragment might either get lost, or it might linger around and do damage (not sure about this). Or, perhaps some other block gets doubly inserted and wreaks havoc. At any rate, the origninal mousetrap is still in, so if this doesn't fix it I may catch a block being written out to a place it doesn't belong. 788. Date: Thu, 14 Dec 89 17:33:29 PST From: tve (Thorsten von Eicken) Subject: is realloc man page correct? The man page says that realloc is compatible with old versions where one is allowed to realloc a block one has freed since the last call to malloc. I'm porting a program which uses that behaviour (sic!) and I get the message "Mem_Size: storage block is free". I also had a look into /sprite/src/lib/c/stdlib/Mem_Size.c and I don't see support for the compatibility. I think non-compatibility is this case to be ok, but please fix the man page is that case. Or did I miss something? TvE (sorry, I don't have an easy example for the bug) 789. Date: Thu, 14 Dec 89 18:47:33 PST From: tve (Thorsten von Eicken) Subject: profiling doesn't work on ds3100 If I compile with -pg, I get an error at the final load: Can't open: /usr/lib/mcrt0.o1.31 (No such file or directory) Is that fixable? Or am I doing something wrong? 790. Date: Fri, 15 Dec 89 08:29:27 PST From: mendel (Mendel Rosenblum) Subject: New sun4c kernel still has the NEW process problem jaywalk% sysstat -v jaywalk SPRITE VERSION 1.046 (sun4c) (14 Dec 89 17:30:19) jaywalk% ps -a | grep NEW 11231 NEW -33901099:-50 11230 NEW 28917113:11 e1220 NEW 0:00 /users/mgbaker/cmds/screenscript -f ... d1226 NEW 0:00 sort /tmp/temp725536 2121d NEW 0:00 xgone b122c NEW 0:00 la 120b NEW 0:00 xgone a1212 NEW 0:00 la f121e NEW 0:00 sh -c /users/mgbaker/cmds/screenscript a1227 NEW 0:00 sh -c echo SUMMARY `hostname` `date` 9120e NEW 0:00 sed -f /users/mgbaker/cmds/screenscript.sed c1221 NEW 0:00 sed -f /users/mgbaker/cmds/screenscript.sed 71238 WAIT 0:00 grep NEW 11232 NEW 0:00 jaywalk% 791. Date: Fri, 15 Dec 89 08:39:35 PST From: Fred Douglis <douglis> Subject: Re: New sun4c kernel still has the NEW process problem that problem won't go away until all sun4cs are running the new sun4c kernel. in fact, it may not go away until the sun4c mach module is changed to un-hold the migrate signal the way the other kernels do -- right now, the change handles exec-time migration but not other migration. since almost all migration is at exec time or is of processes that migrated earlier at exec time, it shouldn't be a problem, but it still has to be fixed. i talked to mary about this briefly -- i hesitated to put the change into the sun4c mach module because the sun4/4c trap code is radically different from the others. 792. Date: Fri, 15 Dec 89 15:30:32 PST From: Fred Douglis <douglis> Subject: new ds3100 X server hangs with certain Xdefaults with the following at the end of my .Xdefaults (loaded via xrdb), I can't talk to the server. If I comment it out I can start X windows up just fine. *Text.Translations: Ctrl<Key>W: delete-previous-word()\n Ctrl<Key>U: beginning-of-line() kill-to-end-of-line()\n Meta<Key>k: kill-selection()\n 793. Date: Fri, 15 Dec 89 17:26:58 PST From: Fred Douglis <douglis> Subject: sendmail/naming problem someone resent a note to me that bounced, with mint constantly trying to forward to itself: ----- Transcript of session follows ----- >>> DATA <<< 554 sendall: too many hops (17 max) 554 <douglis@@sprite.Berkeley.EDU>... Service unavailable: invalid argument ----- Unsent message follows ----- Received: from mint.Berkeley.EDU by sprite.Berkeley.EDU (5.59/1.29) id AA991307; Fri, 15 Dec 89 17:18:22 PST .... Received: by rosemary.Berkeley.EDU (4.0/SMI-4.0) id AA05426; Fri, 15 Dec 89 17:16:37 PST I haven't seen this before, and other mail appears to work okay. 794. Date: Sun, 17 Dec 89 11:54:02 PST From: mendel (Mendel Rosenblum) Subject: sun4c dies horrible death I tried to kill a process on the debug list and jaywalk went into an infinite loop scrambling the video. I had to power cycle to get control back. 795. Date: Sun, 17 Dec 89 15:24:14 PST From: mgbaker (Mary Gray Baker) Subject: Known sparcstation bugs with processes on debug list There are 2 known bugs about continuing processes on the debug list on sparcstations. They are related. In the installed new kernel, the call to Proc_SuspendProc is in the underflow handler for processes that have bad stack pointers. I've already mailed bugs about this problem. Continuing these processes is a very bad idea since the underflow handler can't deal any further with a process with a bad stack pointer. This was an attempt to make debugging of these processes possible, but obviously I must do this a little differently. The other related problem is that migrated processes aren't supposed to go onto the debug list, and I didn't know this before. If the Proc_SuspendProc gets called in the underflow handler on a migrated process, the machine will die in List_Remove. 796. Date: Mon, 18 Dec 89 00:42:31 PST From: shirriff (Ken Shirriff) Subject: eqn on sun3 is confused Eqn puts 3 blank lines after each line containing an equation, when used on a sun3. It works fine on the ds3100. 797. Date: Mon, 18 Dec 89 12:32:54 PST From: pmchen (Peter M. Chen) Subject: diff and cmp decstation (subversion) : diff shows them equivalent cmp shows them equivalent sun4 (anise): diff shows them DIFFERENT cmp shows them equivalent The files are /scratch/pmchen/db2.11.22.{a,b}. Watch out, they're big (80 MB). I am unable to kill (even -9) my diff process. I also can't ^C or ^Z it. Aaah! The unkillable process! :-0 798. Date: Mon, 18 Dec 89 14:49:35 PST From: brent (Brent Welch) Subject: 1.046 FsioVerifyBlockWrite broken & fixed The 1.046 kernel has a botched FsioVerifyBlockWrite routine. It tested ok on arson, and it has been running on oregano. However, the bug shows up on Sun4s, so Allspice and Anise had trouble running this kernel. The bug causes write attempts to fail because the Verify routine returns a bogus value. I've already fixed the code and am installing a new fsio module. 799. Date: Mon, 18 Dec 89 16:17:31 PST From: tve (Thorsten von Eicken) Subject: sun4 cc problem On the sun4, compiling for sun4, cc1.sparc goes into the debugger with MachPageFault: Bus error in user proc .... To duplicate: cd /cad/src/cmds/cifplot; pmake sun4.md/transforms.o It works fine on a sun3, compiling for sun4. 800. Date: Tue, 19 Dec 89 15:57:19 PST From: mgbaker (Mary Gray Baker) Subject: Error when running out of processes My machine ran out of processes, but the error it got first was that it had run out of segments. However, it died with an attempt to free something it thought was already free, namely the free(argString) call in DoExec in the execError section. I can't see why it thought this was already free. 801. Date: Tue, 19 Dec 89 17:11:06 PST From: mgbaker (Mary Gray Baker) Subject: treason not realizing it's idle When treason is idle but has X running on it, rup often fails to report it as idle. Although I haven't verified that this is really the cause of the problem, it's as if there are mouse events generated even when nobody is moving the mouse. This has ramifications for migration, etc. 802. Date: Thu, 21 Dec 89 10:07:30 PST From: Fred Douglis <douglis> Subject: still problems with swapping errors when the net was acting up this morning i got an "error 2 from fs_read or fs_pageread" and my xwatch (xbiff) process died. when i tried to start a new one i hit "reserved instruction in ...". it seems like when there's a paging error on a code segment, the kernel isn't smart enough to nuke the segment and try again next time. last time this happened i had to copy the file into a new inode to get it to run. 803. Date: Thu, 21 Dec 89 12:22:18 PST From: douglis@rosemary.Berkeley.EDU (Fred Douglis) Subject: timer mutex deadlock after reboot mint rebooted just fine, though it had some odd complaints while checking /sprite that suggest the file system is on its way to getting trashed. when i left, it had printed a login prompt and machines were recovering. by the time i got back to 477, mint was in the debugger. it printed on its console that it was syncing its disks but didn't have an error message until below that point when it said that timerMutex was deadlocked. the holder PC and PCB were junk. i poked around in the debugger but couldn't find out where it was before that point, so i rebooted again and am crossing my fingers. any ideas why the PC/PCB wouldn't be right? that's in all kernels, not just special ones, correct? 804. Date: Fri, 22 Dec 89 08:32:18 PST From: brent (Brent Welch) Subject: sun4 (anise) X11R3 xinit dies One problem with X11R3 concerns anise, the sun4/260. xinit goes into the debugger upon startup. There may be some fix, but I was not able to run X11R3 on anise because of this. 805. Date: Fri, 22 Dec 89 12:42:08 PST From: mendel (Mendel Rosenblum) Subject: /X/cmds.sun4/Xsprite dies frequently /X/cmds.sun4/Xsprite dies with much greater frequency (four times in the last couple of hours verse once a day) when the kernel grows over 6 megabytes. This might suggest that there a bug in the sun4c VM operating with low numbers of pmegs and/or free memory pages. Don't bother to try to debug the Xsprite because you will get a watchdog reset everytime. 806. Date: Fri, 22 Dec 89 13:01:22 PST From: mendel (Mendel Rosenblum) Subject: minor bug in Mx If you select a control-L and insert it into a Mx search window you get a small black rectangle rather than something representing a control-L. The search works (it finds the control-L's and not small black rectangles). Some control characters (such as control-G and control-F) come out as spaces in the search window. 807. Date: Fri, 22 Dec 89 17:21:38 PST From: pmchen (Peter M. Chen) Subject: floating point on sun3 I've gotten some results that say (double) 49 / (double) 5030 * 1000.0 is 0.00. This only happens on the sun3, decstations and sun4s give the correct answer. I couldn't duplicate this in a simpler program, but you can see this by running ~pmchen/tmp/mult/t1 as me from ~pmchen/tmp/mult. The output file is in mult.out. Look for the line that says I/O's per second. The source is ~pmchen/raid/mult/mult.c 808. Date: Sun, 31 Dec 89 11:03:18 PST From: mendel (Mendel Rosenblum) Subject: Xmfb for sparcstation bug The X11R3 Xmfb seems to have trouble rendering small stipple-filled rectangles. This is why the racing stripes of Sx toolkit and broken on jaywalk and the other black and white sparcstations. I try to debug it but the object files don't seems to match the source. 809. Date: Sun, 31 Dec 89 15:27:45 PST From: tve (Thorsten von Eicken) Subject: cc dies on sun3 and sun4: can't compile! try a pmake in /X11R3/src/cmds/xgraph, when it compiles xgraph.o cc1 either goes into debug or the next phase complains forever that /tmp/cc079707.s:6841:End-of-File not at end of a line 810. Date: Sun, 31 Dec 89 15:48:57 PST From: mgbaker (Mary Gray Baker) Subject: X11R3 color database still in trouble Now my window with black background and red foreground comes out completely red rather than completely black. I agree this is more colorful, but it's equally impossible to use. Also, my light blue background has turned itself to purple. 811. Date: Sun, 31 Dec 89 23:44:17 PST From: tve (Thorsten von Eicken) Subject: msgs doesn't seem to get updated At least there are more recent messages on ernie.